GSoC 1: Describing the project: Adding spatial partitioning to Dask-GeoPandas
June 20, 2021 GSoc Dask-GeoPandas Spatial partitioning
About the project
Earlier this year I was accepted onto the Google Summer of Code 2021 to create a GeoPandas-Dask bridge to scale geospatial analysis, supported by numfocus.org. This 10 week program gives me the opportunity to develop software skills with some of the key GeoPandas developers.
Dask-GeoPandas is a relatively immature project, which aims to overcome the scalability limitation of using GeoPandas by using Dask as a bridge. GeoPandas is the most popular Python project for handling geographic problems and data but computation must be done in-memory. Dask is an open source Python library that enables working with arbitrarily large datasets and dramatically increases the speed of computation. Dask provides a familiar API to the Pydata stack (Pandas, Scikit-Learn, Numpy) and overcomes the scalability problem of these tools - which were not designed to be scalable for datasets that are larger than memory. This is achieved by partitioning large datasets into smaller chunks and applying a function to a subset of the dataset.
As noted in the previous blog post, the key challenge of partitioning spatial data is that it is multi-dimensional, which requires a different approach to one-dimensional data. Dask-GeoPandas currently repartitions datasets using an element-wise operation, which means that every possible candidate in the spatial query must be evaluated. Spatially partitioning the data is a great improvement for Dask-GeoPandas for both IO and applying spatial operations. By only considering potential candidates that spatially overlap one another, in terms of their spatial partition, this reduces the time-complexity considerably.
Spatial partitioning in Python isn’t new, and is something that SpatialPandas has already implemented. The next blogpost will cover what spatial partitioning is, and I will then cover common techniques.