At Leaf, we're constantly striving to push the boundaries of agricultural data and how to make it easy for our customers to build their data pipelines and applications on the best foundation we can provide.
Today, we're excited to announce a significant upgrade to our technology stack: the transition from GeoJSON to GeoParquet. This move not only reinforces our commitment to leveraging cutting-edge technology but also brings a host of benefits to our customers, including the ability to retrieve data from Leaf in GeoParquet format.
What is Geoparquet?
GeoParquet is an open standard for encoding geospatial data in the Parquet format. Parquet, developed by Cloudera and Twitter, is a columnar storage file format optimized for large-scale data processing.
GeoParquet extends Parquet's capabilities to handle geospatial data efficiently, making it ideal for big data applications in the geospatial field. Its columnar format allows for high compression rates and faster query performance, providing better storage efficiency and processing speed compared to traditional formats like GeoJSON.
Piloting with GeoParquet
As with any big infrastructure or pipeline change, we try to test out new technologies at small scale and via pilot projects to understand how they work, how to work with them, and what the pros and cons will be of adopting them. Here’s what we’ve found:
Pros | |
---|---|
Efficient Storage: GeoParquet uses columnar storage, which can be more space-efficient compared to row-based formats. | |
Performance: Parquet format is optimized for query performance. | |
Compression: Built-in support for various compression algorithms reduces storage requirements. | |
Schema Evolution: Parquet supports schema evolution, allowing changes to the data schema without significant overhead. | |
Interoperability: Widely supported across various big data tools and platforms. |
Cons |
---|
Complexity: Requires more sophisticated tools and libraries for reading and writing compared to simpler formats like GeoJSON. |
Overhead: For smaller datasets, the overhead of converting to and from Parquet may not be justified by the performance benefits. |
Limited Support: Some traditional GIS tools may not support GeoParquet natively. |
Building with GeoParquet and Sedona
Of the tests we ran, one in particular we think is worth sharing in case others are making a similar type of decision. Our decision to transition to GeoParquet was driven by the need for faster processing and better handling of large datasets. We conducted extensive tests comparing the use of GeoParquet with two different frameworks: Python and Sedona.
Setup for Comparing GeoParquet in Python vs. Sedona
Python | |
---|---|
Mean Conversion Time - The time taken to convert GeoJSON files to GeoParquet format. | 8.86 seconds |
Mean Memory Consumption - The amount of memory used during the conversion process. | 119.80 MB |
Output File Size - The size of the resulting GeoParquet files. | File sizes ranged from 0.04 to 9.15 MB, with a mean of 2.95 MB. |
Sedona | |
---|---|
Mean Conversion Time - The time taken to convert GeoJSON files to GeoParquet format. | 2.21 seconds |
Mean Memory Consumption - The amount of memory used during the conversion process. | 29.73 MB |
Output File Size - The size of the resulting GeoParquet files. | File sizes ranged from 0.03 to 5.12 MB, with a mean of 1.67 MB. |
The results for us are clear and underscore the value of using Sedona for GeoParquet conversions. Sedona's faster conversion times and lower, more consistent memory consumption make it a superior choice for handling large and variable datasets. This improvement ensures that we can process high-resolution as-planted, as-applied, and harvested files without performance issues, providing accurate and timely insights for our customers.
Available now: Export data in GeoParquet format from Leaf
With our internal move to GeoParquet, we're excited to make this powerful tool available to our customers. If you would like GeoParquet enabled for your account, please contact us.
Once enabled, GeoParquet files will be available for new data using the/geoparquet endpoints.
With this configuration enabled, customers can retrieve GeoParquet files containing the same data, benefiting from faster processing and reduced storage costs.
For more information and detailed instructions on enabling GeoParquet, please visit our developer documentation.