GSoC 5: Summarising the project

August 20, 2021    GSoc Dask-GeoPandas Spatial partitioning

Wrapping it up

In this blogpost I will provide a summary of the Google Summer of Code project (Dask bridge to scale geospatial analysis) I have been working on over the past 10 weeks. Before summarising this project, I would like to thank my mentors (Joris Bossche, Martin Fleischmann, Stefanie Lumnitz & Brendan Ward) for their continuous guidance and direction during this project. This was my first open source project and their help has been invaluable. If we ever meet in person, I owe you a beer or two!

Timeline

Early on during the project, I created a simple Gantt chart on Google Sheets to create a timeline to monitor and evaluate progress. This helped not only me but also my mentors keep track of progress and to modify the general outline of the project to reflect what was feasible during the limited time. This was broken down into main 5 components (with included pull requests);

  1. Project planning & maintenance
  2. Spatial partitioning based on Hilbert distance (the distance of the geometry mid points along the Hilbert curve) - #PR70
  3. Spatial partitioning based on Morton distance (the distance of the geometry mid points along the Z-order/Morton curve) #PR90
  4. Spatial partitioning based on Geohash (encoded using the distance of the geometry mid points ) #PR93
  5. Spatial shuffle - #PR104

The original proposal (on Google Documents) included both spatial partitioning and indexing but once the project started, we decided to focus primarily on the spatial partitioning. I ended up implementing several spatial partitioning methods, where each of them has their own advantages - see discussion below comparing methods.

Breakdown of each PR submitted

Hilbert distance #PR70 - status: merged

  • Refactor Hilbert distance based on SpatialPandas implementation. This calculated the distance of geometry mid points along the Hilbert curve.

Morton distance #PR90 - status: merged

  • Calculated the distance of geometry mid points along the Morton curve distance based on Sean Anderson’s Bit Twiddling Hacks - #http://graphics.stanford.edu/~seander/bithacks.html#InterleaveBMN

Geohash #PR93 - status: open

  • Partitioned Dask-GeoPandas GeoDataFrames/GeoSeries initially using pygeohash package
  • Added numpy version of Geohash encode based on deprecated neathgeohash package.
  • Added raw argument to geohash function to byte or unicode encoding

Spatial shuffle - #104

  • To spatially shuffle Dask-GeoPandas objects using either one of the 3 methods (hilbert, morton and geohash) or a user defined column.

Comparing spatial partitioning methods

I created the Dask-GeoPandas docs notebook, showing how to use spatial_shuffle to partition Dask-GeoDataFrames using the three methods (Hilbert distance, Morton distance and Geohash). Although the notebook provides detail on the how to spatially partition GeoDataFrames, it doesn’t provide examples on which tool is the best job. While there is no one-fit-all tool because of the different mathematical properties of each partitioning method, I will provide some use case studies to highlight their use cases further. This will include looking at:

  1. Speed - this goal is pretty straight forward

    import pandas as pd
    import geopandas
    import dask_geopandas
    
    gdf = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres') )
    gdf = pd.concat([gdf] * 1000, axis=0)
    ddf = dask_geopandas.from_geopandas(gdf, npartitions=4)
    
    %timeit ddf.hilbert_distance(p=12).compute()
    167 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit ddf.morton_distance(p=12).compute()
    50.6 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    %timeit ddf.geohash(precision=12).compute()
    63.8 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
  2. Relationships between partitions - specifically the relationship the geometries of spatial partitions.

convex hull of the spatial partitions overlap.

  • Areal overlap calculation
  • esda’s completeness function which was rafactored recently by @ljw from nowosad and stepinski - https://doi.org/10.1080/13658816.2018.1511794. This measures the completeness of the partitions of polygons in a to those in b - where a value closer to 1 represents all polygons in a being fully contained by polygons in b.

Follow-up issues

There are several follow up issues that need to be addressed after this period. This includes;

Wrapping up Google Summer of Code

I have learnt a lot and I plan on continue contributing to Dask-GeoPandas and other open-source projects. I have discussed with my mentors the follow-up issues that need to be addressed and I will continue to add further functionality in the following weeks/months to come!

...