Extremely slow performance with shapefile and geopackage


Sean Gillies
 

David,

I am not a geopackage expert, but yes, it looks like your sqlite3 library lacks the rtree extension.

In the future, please provide these kind of error messages and some details about the sources of your software (conda? pip?) when you ask for help. That will help us get to the heart of the matter and prevent speculation about things like transaction sizes.


On Sun, Nov 17, 2019 at 2:33 AM David Epstein <davideps@...> wrote:
Thank you Sean and Rene. When I load only 1,000 features and save as a shapefile, it writes quickly and correctly (i.e. it is loadable in QGIS). However, I get a new error when I save to geopackage. I think I need to recompile SQLITE with RTREE support, right? Do I do this directly with a C-compiler or is this something that I can do indirectly via conda or pip? Here is the error:

>>> geodata.to_file(data_dir+"data.gpkg", driver="GPKG")
Traceback (most recent call last):
  File "fiona/_err.pyx", line 201, in fiona._err.GDALErrCtxManager.__exit__
fiona._err.CPLE_AppDefinedError: b'sqlite3_exec(CREATE VIRTUAL TABLE "rtree_data_geom" USING rtree(id, minx, maxx, miny, maxy)) failed: no such module: rtree'
Exception ignored in: 'fiona._shim.gdal_flush_cache'
Traceback (most recent call last):
  File "fiona/_err.pyx", line 201, in fiona._err.GDALErrCtxManager.__exit__
fiona._err.CPLE_AppDefinedError: b'sqlite3_exec(CREATE VIRTUAL TABLE "rtree_data_geom" USING rtree(id, minx, maxx, miny, maxy)) failed: no such module: rtree'

On Thu, Nov 14, 2019 at 7:11 PM Sean Gillies <sean.gillies@...> wrote:
René, this is all good advice.

In the past, saving to a geopackage could be especially slow because of the overhead of transactions. Every Collection write() would happen within its own transaction. Since version 1.8 of Fiona, calling writerecords() uses a default transaction size of 20,000 features (see https://fiona.readthedocs.io/en/latest/README.html#a1-2017-11-06) and is much faster.

Geopandas uses the faster method since https://github.com/geopandas/geopandas/issues/557#issuecomment-332202764. You may want to check to see if a geopandas upgrade improves your situation.

On Thu, Nov 14, 2019 at 8:33 AM René Buffat <buffat@...> wrote:
Hi David

First I would check if you have a sufficient amount of RAM available. If not, this could explain the slow performance.
If this is the case, I would recommend to read, process and write the data in batches.

Otherwise, there are a lot of parameters that can impact the performance. E.g. how complex the geometries are, how many rows you want to write, how many parallel reads and write you have to the disk, etc.

Regarding geometries problems, I'm not entirely sure what you mean. But regardless, with big datasets, it's always a good option to debug with smaller datasets (e.g. the first thousand lines) and then test if everything works. 

And fully unrelated, I would recommend to use os.path.join(datadir, "data.shp") instead of data_dir+"data.gpkg"

lg rene



--
Sean Gillies



--
Sean Gillies


davideps@...
 

Thank you Sean and Rene. When I load only 1,000 features and save as a shapefile, it writes quickly and correctly (i.e. it is loadable in QGIS). However, I get a new error when I save to geopackage. I think I need to recompile SQLITE with RTREE support, right? Do I do this directly with a C-compiler or is this something that I can do indirectly via conda or pip? Here is the error:

>>> geodata.to_file(data_dir+"data.gpkg", driver="GPKG")
Traceback (most recent call last):
  File "fiona/_err.pyx", line 201, in fiona._err.GDALErrCtxManager.__exit__
fiona._err.CPLE_AppDefinedError: b'sqlite3_exec(CREATE VIRTUAL TABLE "rtree_data_geom" USING rtree(id, minx, maxx, miny, maxy)) failed: no such module: rtree'
Exception ignored in: 'fiona._shim.gdal_flush_cache'
Traceback (most recent call last):
  File "fiona/_err.pyx", line 201, in fiona._err.GDALErrCtxManager.__exit__
fiona._err.CPLE_AppDefinedError: b'sqlite3_exec(CREATE VIRTUAL TABLE "rtree_data_geom" USING rtree(id, minx, maxx, miny, maxy)) failed: no such module: rtree'


On Thu, Nov 14, 2019 at 7:11 PM Sean Gillies <sean.gillies@...> wrote:
René, this is all good advice.

In the past, saving to a geopackage could be especially slow because of the overhead of transactions. Every Collection write() would happen within its own transaction. Since version 1.8 of Fiona, calling writerecords() uses a default transaction size of 20,000 features (see https://fiona.readthedocs.io/en/latest/README.html#a1-2017-11-06) and is much faster.

Geopandas uses the faster method since https://github.com/geopandas/geopandas/issues/557#issuecomment-332202764. You may want to check to see if a geopandas upgrade improves your situation.

On Thu, Nov 14, 2019 at 8:33 AM René Buffat <buffat@...> wrote:
Hi David

First I would check if you have a sufficient amount of RAM available. If not, this could explain the slow performance.
If this is the case, I would recommend to read, process and write the data in batches.

Otherwise, there are a lot of parameters that can impact the performance. E.g. how complex the geometries are, how many rows you want to write, how many parallel reads and write you have to the disk, etc.

Regarding geometries problems, I'm not entirely sure what you mean. But regardless, with big datasets, it's always a good option to debug with smaller datasets (e.g. the first thousand lines) and then test if everything works. 

And fully unrelated, I would recommend to use os.path.join(datadir, "data.shp") instead of data_dir+"data.gpkg"

lg rene



--
Sean Gillies


Sean Gillies
 

René, this is all good advice.

In the past, saving to a geopackage could be especially slow because of the overhead of transactions. Every Collection write() would happen within its own transaction. Since version 1.8 of Fiona, calling writerecords() uses a default transaction size of 20,000 features (see https://fiona.readthedocs.io/en/latest/README.html#a1-2017-11-06) and is much faster.

Geopandas uses the faster method since https://github.com/geopandas/geopandas/issues/557#issuecomment-332202764. You may want to check to see if a geopandas upgrade improves your situation.

On Thu, Nov 14, 2019 at 8:33 AM René Buffat <buffat@...> wrote:
Hi David

First I would check if you have a sufficient amount of RAM available. If not, this could explain the slow performance.
If this is the case, I would recommend to read, process and write the data in batches.

Otherwise, there are a lot of parameters that can impact the performance. E.g. how complex the geometries are, how many rows you want to write, how many parallel reads and write you have to the disk, etc.

Regarding geometries problems, I'm not entirely sure what you mean. But regardless, with big datasets, it's always a good option to debug with smaller datasets (e.g. the first thousand lines) and then test if everything works. 

And fully unrelated, I would recommend to use os.path.join(datadir, "data.shp") instead of data_dir+"data.gpkg"

lg rene



--
Sean Gillies


René Buffat
 

Hi David

First I would check if you have a sufficient amount of RAM available. If not, this could explain the slow performance.
If this is the case, I would recommend to read, process and write the data in batches.

Otherwise, there are a lot of parameters that can impact the performance. E.g. how complex the geometries are, how many rows you want to write, how many parallel reads and write you have to the disk, etc.

Regarding geometries problems, I'm not entirely sure what you mean. But regardless, with big datasets, it's always a good option to debug with smaller datasets (e.g. the first thousand lines) and then test if everything works. 

And fully unrelated, I would recommend to use os.path.join(datadir, "data.shp") instead of data_dir+"data.gpkg"

lg rene


davideps@...
 

Hi folks,

I loaded a 5GB CSV file, turned lat/long into points, and have tried to save as shapefile and geopackage. At first, I ran into problems with datetime and tuple columns, which I've since turned to STR or dropped entirely. However, after an hour I still was not able to save. Lines like:
geodata.to_file(filename=data_dir+"data.shp"driver = 'ESRI Shapefile')
geodata.to_file(data_dir+"data.gpkg"driver="GPKG")
seem to hang with empty files written to disk. When I kill the line in the interpreter, sometimes data gets written at the last minute--sometimes not. Are there common dateframe or geometry problems I should check first? 

-david