GSoC 3: Adding Geohash to Dask-GeoPandas spatially partition GeoDataFrames

July 19, 2021 GSoc Dask-GeoPandas Spatial partitioning

Last week

In the last blogpost I introduced the Hilbert curve as one technique for spatial partitioning a Dask GeoDataFrame. Since then, I have also created some tests to ensure the Hilbert distance implementation was valid, using the hilbertcurve package as the benchmark. Hopefully all the changes made to the Hilbert distance PR request will be ok’d and be merged into master.

This week

The geohash (www.geohash.org) is a convenient, global hierarchical grid-based method for encoding coordinates (x,y) into a one-dimensional string, composed of integers and characters. This means that grid cells of the same hierarchical level do not overlap, which means this simple technique is useful for indexing spatial point features (indexing lines and polygons is possible but more difficult) and suffers no performance degradation with a growing number of features. The length of the string is directly tied to the converted precision of the coordinates, where each character in the geohash adds additional precision. Additionally, the longer a shared prefix is between two geohashes, the more spatially closer they are. This means that geospatial proximity searches can be done using just the prefix, without having to perform expensive spatial operations. This makes geohashes well suited for indexing mechanisms, grouping geographically located entities, and compactly encoding coordinates based on the level of detail required. However, non-projected coordinates can only be converted (e.g. EPSG:4326). Moreover, there are boundary problems: -180 and 180 indicate the same place, but they have completely different Geohash codes. Finally, due to the shape of Earth (oblate spheroid), Geohash deviates sharply as the latitude increases. Therefore, distance calculations at the poles are less accurate:

geohash length (km)	lat bits	lng bits	lat error	lng error	km error
1	2	3	±23	±23	±2500
2	5	5	±2.8	±5.6	±630
3	7	8	±0.70	±0.70	±78
4	10	10	±0.087	±0.18	±20
5	12	13	±0.022	±0.022	±2.4
6	15	15	±0.0027	±0.0055	±0.61
7	17	18	±0.00068	±0.00068	±0.076
8	20	20	±0.000085	±0.00017	±0.019

Table: Geohash Digits precision in km, width, height, latitude and longitude error

Geohash typically uses 32 characters for encoding coordinates (x,y), intro strings where points close to each other have closer values. A prefix always contains a range represented by longer code characters. F or example, wb12x comprises five characters, and it contains the grid wb12xabcd. In this case, wb12x represents a larger grid, and wb12xabcd represents a smaller grid. Like floating-point numbers, geohashes contain some error depending on how many bits they have. Therefore a longer Geohash string has a smaller error. This means that you can’t technically decode a geohash to a specific point, but you can usually approximate by taking the midpoint of the error box. Decoding is similar to encoding but instead of bisecting based on midpoint comparisons, you go backwards by setting either the min or max value to the midpoint based on the bits from the geohash.

Initial work

I began by first vectorising the neathgeohash implementation using numpy. I then took the same approach earlier to spatially partitioned Dask-GeoPandas GeoDataFrames. I found that when using a metric such as the summed distances between points along the curve, the hilbert curve preserves locality better. However, the jumps between blocks are quite large. I think the Geohash is useful in scenarios when the Geohash is precomputed, and the coordinates are not-projected.

«!– Integrating GeoRaptor - which starts at the highest Geohash level and iterates through each level to find the best combination of geohashes across various levels. This reduces the data size for large polygons - improving speed and performance. ElasticSearch is using the same strategy for polygons indexing https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-shape.html-->