Wrapping it up

In this blogpost I will provide a summary of the Google Summer of Code project (Dask bridge to scale geospatial analysis) I have been working on over the past 10 weeks. Before summarising this project, I would like to thank my mentors (Joris Bossche, Martin Fleischmann, Stefanie Lumnitz & Brendan Ward) for their continuous guidance and direction during this project. This was my first open source project and their help has been invaluable. If we ever meet in person, I owe you a beer or two!

Timeline

Early on during the project, I created a simple Gantt chart on Google Sheets to create a timeline to monitor and evaluate progress. This helped not only me but also my mentors keep track of progress and to modify the general outline of the project to reflect what was feasible during the limited time. This was broken down into main 5 components (with included pull requests);

Project planning & maintenance
Spatial partitioning based on Hilbert distance (the distance of the geometry mid points along the Hilbert curve) - #PR70
Spatial partitioning based on Morton distance (the distance of the geometry mid points along the Z-order/Morton curve) #PR90
Spatial partitioning based on Geohash (encoded using the distance of the geometry mid points ) #PR93
Spatial shuffle - #PR104

The original proposal (on Google Documents) included both spatial partitioning and indexing but once the project started, we decided to focus primarily on the spatial partitioning. I ended up implementing several spatial partitioning methods, where each of them has their own advantages - see discussion below comparing methods.

Breakdown of each PR submitted

Hilbert distance #PR70 - status: merged

Refactor Hilbert distance based on SpatialPandas implementation. This calculated the distance of geometry mid points along the Hilbert curve.

Morton distance #PR90 - status: merged

Calculated the distance of geometry mid points along the Morton curve distance based on Sean Anderson’s Bit Twiddling Hacks - #http://graphics.stanford.edu/~seander/bithacks.html#InterleaveBMN

Geohash #PR93 - status: open

Partitioned Dask-GeoPandas GeoDataFrames/GeoSeries initially using pygeohash package
Added numpy version of Geohash encode based on deprecated neathgeohash package.
Added raw argument to geohash function to byte or unicode encoding

Spatial shuffle - #104

To spatially shuffle Dask-GeoPandas objects using either one of the 3 methods (hilbert, morton and geohash) or a user defined column.

Comparing spatial partitioning methods

I created the Dask-GeoPandas docs notebook, showing how to use spatial_shuffle to partition Dask-GeoDataFrames using the three methods (Hilbert distance, Morton distance and Geohash). Although the notebook provides detail on the how to spatially partition GeoDataFrames, it doesn’t provide examples on which tool is the best job. While there is no one-fit-all tool because of the different mathematical properties of each partitioning method, I will provide some use case studies to highlight their use cases further. This will include looking at:

Speed - this goal is pretty straight forward

import pandas as pd
import geopandas
import dask_geopandas

gdf = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres') )
gdf = pd.concat([gdf] * 1000, axis=0)
ddf = dask_geopandas.from_geopandas(gdf, npartitions=4)

%timeit ddf.hilbert_distance(p=12).compute()
167 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ddf.morton_distance(p=12).compute()
50.6 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit ddf.geohash(precision=12).compute()
63.8 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Relationships between partitions - specifically the relationship the geometries of spatial partitions.

convex hull of the spatial partitions overlap.

Areal overlap calculation
esda’s completeness function which was rafactored recently by @ljw from nowosad and stepinski - https://doi.org/10.1080/13658816.2018.1511794. This measures the completeness of the partitions of polygons in a to those in b - where a value closer to 1 represents all polygons in a being fully contained by polygons in b.

Follow-up issues

There are several follow up issues that need to be addressed after this period. This includes;

Issue created for Hilbert Distance #85
Issue created for Morton distance #92
Issue created for Geohash #95
Another follow up issue is to create another PR request, which includes the draft jupyter notebook that explains spatial partitioning and provides some use cases to compare each of the spatial partitioning methods further. This will include comparing different precisions and number of partitions.

Wrapping up Google Summer of Code

I have learnt a lot and I plan on continue contributing to Dask-GeoPandas and other open-source projects. I have discussed with my mentors the follow-up issues that need to be addressed and I will continue to add further functionality in the following weeks/months to come!

GSoC 5: Summarising the project

August 20, 2021 GSoc Dask-GeoPandas Spatial partitioning

Wrapping it up

Timeline

Breakdown of each PR submitted

Hilbert distance #PR70 - status: merged

Morton distance #PR90 - status: merged

Geohash #PR93 - status: open

Spatial shuffle - #104

Comparing spatial partitioning methods

Follow-up issues

Wrapping up Google Summer of Code

Related Posts

August 2, 2021

GSoC 4: Adding the Morton distance to spatially partition Dask-GeoPandas GeoDataFrames

July 19, 2021

GSoC 3: Adding Geohash to Dask-GeoPandas spatially partition GeoDataFrames

July 5, 2021

GSoC 2: Adding the hilbert distance to Dask-GeoPandas to spatially partition GeoDataFrames