Re: Extremely slow performance with shapefile and geopackage


davideps@...
 

Thank you Sean and Rene. When I load only 1,000 features and save as a shapefile, it writes quickly and correctly (i.e. it is loadable in QGIS). However, I get a new error when I save to geopackage. I think I need to recompile SQLITE with RTREE support, right? Do I do this directly with a C-compiler or is this something that I can do indirectly via conda or pip? Here is the error:

>>> geodata.to_file(data_dir+"data.gpkg", driver="GPKG")
Traceback (most recent call last):
  File "fiona/_err.pyx", line 201, in fiona._err.GDALErrCtxManager.__exit__
fiona._err.CPLE_AppDefinedError: b'sqlite3_exec(CREATE VIRTUAL TABLE "rtree_data_geom" USING rtree(id, minx, maxx, miny, maxy)) failed: no such module: rtree'
Exception ignored in: 'fiona._shim.gdal_flush_cache'
Traceback (most recent call last):
  File "fiona/_err.pyx", line 201, in fiona._err.GDALErrCtxManager.__exit__
fiona._err.CPLE_AppDefinedError: b'sqlite3_exec(CREATE VIRTUAL TABLE "rtree_data_geom" USING rtree(id, minx, maxx, miny, maxy)) failed: no such module: rtree'


On Thu, Nov 14, 2019 at 7:11 PM Sean Gillies <sean.gillies@...> wrote:
René, this is all good advice.

In the past, saving to a geopackage could be especially slow because of the overhead of transactions. Every Collection write() would happen within its own transaction. Since version 1.8 of Fiona, calling writerecords() uses a default transaction size of 20,000 features (see https://fiona.readthedocs.io/en/latest/README.html#a1-2017-11-06) and is much faster.

Geopandas uses the faster method since https://github.com/geopandas/geopandas/issues/557#issuecomment-332202764. You may want to check to see if a geopandas upgrade improves your situation.

On Thu, Nov 14, 2019 at 8:33 AM René Buffat <buffat@...> wrote:
Hi David

First I would check if you have a sufficient amount of RAM available. If not, this could explain the slow performance.
If this is the case, I would recommend to read, process and write the data in batches.

Otherwise, there are a lot of parameters that can impact the performance. E.g. how complex the geometries are, how many rows you want to write, how many parallel reads and write you have to the disk, etc.

Regarding geometries problems, I'm not entirely sure what you mean. But regardless, with big datasets, it's always a good option to debug with smaller datasets (e.g. the first thousand lines) and then test if everything works. 

And fully unrelated, I would recommend to use os.path.join(datadir, "data.shp") instead of data_dir+"data.gpkg"

lg rene



--
Sean Gillies

Join main@fiona.groups.io to automatically receive all group messages.