GSoC 2: Adding the hilbert distance to Dask-GeoPandas to spatially partition GeoDataFrames

July 5, 2021 GSoc Dask-GeoPandas Spatial partitioning

Following on from the previous blogpost, I am working on adding spatial partitioning to Dask-GeoPandas, as part of Google’s Summer of Code. The key challenge of partitioning spatial data is that geometries are two-dimensional, requiring them to be transformed and stored as a one-dimensional data structure. One way to achieve this is to use space filling curves to transform high dimensional objects to be expressed, stored and retrieved as a one-dimensional data structure.

Space filling curves are lines that pass through every point in space, in a particular order and pass through points once only, so that each point lies a unique distance along the curve. They have many applications from big data analysis and visualization, image decomposition, computer graphics and geographic information systems. Common space filling curves include the Hilbert and Morton curve. The range of both curves contain the entire 2-dimensional unit square, which means that the Hilbert and Morton curves can handle projected coordinates.

The basic premise behind using SFCs for partitioning spatial data is:

First subdivide the space into logic clusters of “close” points
Given the coordinates of a point (x, y) within a unit square and the distance d along the curve when it reaches that point, then points that have nearby d values will also have nearby (x, y) values.
This means that we do not have to scan the whole list of points to find ones that are geographically close, but just a few either side of the starting point.

Different space-filling curves define closeness using different shape patterns and various parameters, including: the cluster order, distance metrics, and the pattern shape. The shapes and the parameters of these curves will determine how the spatial data is partitioned. The choice of the shape and parameters must be determined on a case-to-case basis, including the structure of the data and the intended application.

Since SpatialPandas Scipy 2020 talk noted that the hilbert curve was considerably faster than Geohash, I will start there. Technically, the Geohash is not a space filling curve, but that will be discussed in the next blog post.

Hilbert curve

The Hilbert curve contains the entire 2-dimensional object, which means that for every point within a square, we can map it onto a point on the (one-dimensional) curve and back again through encoding and decoding. The hilbert curve is also fractal and self-similar. If you examine the six iterations of the Hilbert curve construct below, you can see that the curve is built recursively at each successive approximation for that curve - known as an order:

Figure: First 6 orders of the Hilbert curve construction

How is this implemented?

The Dask-GeoPandas spatial_shuffle calls all three spatial partitioning methods:

Firstly, the mid-point of the bounds for each geometry is calculated The midpoint is simple math based on bounds, the centroid is more complex based on calculating the “center of mass” of the geometry (varies by geometry type), and “representative point” (point on surface) is even more complex since it needs to ensure the result intersects the polygon. This simple assumption does speed up the calculation of the hilbert distance, but is not precise for large polygon/linestring geometries.
Secondly, calculate the hilbert distance for each partition. The hilbert distance is calculated as the distance of the mid point along the curve, where the bounds of the curve is defined as the total bounds.
Thirdly, the spatial_shuffle method then sets the index using the calculated hilbert distances for each of the npartitions selected.
Finally, the spatial partitions are calculated by default using the convex hull method.

One aspect that I haven’t discussed is the hilbert p value. By default, the p value is 16. Values closer to 1 have smaller precision, whereas larger values have higher precision.

Next week

I initially implemented the hilbert distance using a pure numpy version #70, but was later swapped with bit shifting, as used by the Morton distance. In the next blog post I will discuss the progress of both the hilbert distance and the Geohash implementation.