GSoC 4: Adding the Morton distance to spatially partition Dask-GeoPandas GeoDataFrames

August 2, 2021 GSoc Dask-GeoPandas Spatial partitioning

Last week

In the last blog post I discussed the updates to the Geohash implementation. This is now largely complete and I plan to create a PR request for this sometime early next week. Following what was discussed in the meeting on 22/07/2021, I went away and decided to also compute the Morton curve as a more generalisable method to the Geohash. I still think the Geohash method could be useful

As a recap, the goal of spatial partitioning is to encode multi-dimensional values into one-dimensional space whilst preserving “closeness” (which can be defined in several ways). In the context of Geospatial data, closeness is generally interpreted as “geographic closeness” - that is, closer points in geographic space tend to be more similar than points further away. If we were to consider linear ordering of coordinates, closeness would only depend on either $x$ or $y$ and “closeness” would only be sensitive to one of these values.

This week

This week I have worked on computing the Morton curve as a secondary Space-Filling Curve (SFC) to be used for spatial partitioning Dask-GeoDataFrames. Following a discussion about the Geohash, it was decided to simplify the Geohash by generalising (Multi)LineStrings or (Multi)Polygons through the use of mid-points again. This does certainly allow us to directly compare the three spatial partitioning methods directly. This generalisation does not consider the geometric detail of more complex structures but for the purpose of (basic) spatial partitioning, this is a useful baseline to go off. Considering how to handle this is now outside of the scope of Google Summer of Code but will form a nice follow-up for PR requests.

import geopandas
import dask_geopandas
import pandas as pd
import numpy as np
from hilbertcurve.hilbertcurve import HilbertCurve 
import matplotlib.pyplot as plt

p = 5 # n order or iteration of the sfc

gdf = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres') )
d_gdf = dask_geopandas.from_geopandas(gdf, npartitions=4)

From some initial testing, there is very little difference in compute time between both SFC. Whilst the Hilbert distance makes use of Numba, no Numba acceleration has been used in computing the Morton distance because it’s much simpler.

%timeit d_gdf.hilbert_distance(p).compute()

33.1 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit d_gdf.morton_distance(p).compute()

33.1 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If we now consider the spatial distribution of the partitions, we can see that both methods have similar patterns. However, it looks like in this case, the Hilbert distance does seem to preserve locality better.

gdf["hilbert_distance"] = h_distance.to_frame().astype(str)
gdf["morton_distance"] = m_distance.to_frame().astype(str)

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,20))
gdf.plot(ax=ax1, column='hilbert_distance', categorical=True)
gdf.plot(ax=ax2, column='morton_distance', categorical=True)

fig.tight_layout()

ax1.set_title('Hilbert distance', size=16)
ax2.set_title('Morton distance', size=16)

ax1.set_axis_off()
ax2.set_axis_off();

Final week

During the final week, I plan to spend most of the time making sure each of the partitioning methods are well documented. This will allow users of the Dask-GeoPandas package to apply and understand the differences between each of the spatial partitioning methods. This will generally take the form of making sure the docstrings are readable and creating a jupyter notebook.

In the last meeting, Brendan raised a really useful point about measuring the performance of each spatial partitioning method. This will need to be completed to see which of the three methods are most performant in different situations. Comparing both SFC is relatively simple - when using a metric such as the summed distances between points along the curve, the Hilbert curve preserves locality better. The biggest jump is quite big along the z-order curve - where it passes order^2 blocks. When also considering the Geohash, this is a little more challenging. An initial thought was to measure how much the partitions spatially overlap one another - where the method with the least overlap is best suited. This could be simply calculated by measuring how much the partitions overlap one another using a metric such as area. More robust metrics include Pysal’s completeness() function, which measures the completeness of the partitions of polygons $a$ to those in polygon $b$. A value of 1 is given for when $a$ is fully contained within $b$. We could also use other metrics including measuring the shape/compactness of each spatial partition, using the minimum_bounding_circle_ratio function.