Table of Contents
Fetching ...

LiLIS: Enhancing Big Spatial Data Processing with Lightweight Distributed Learned Index

Zhongpu Chen, Wanjun Hao, Ziang Zeng, Long Shi, Yi Wen, Zhi-Jie Wang, Yu Zhao

TL;DR

LiLIS tackles the substantial index construction and query overhead observed in distributed big spatial analytics. It introduces a spline-based, error-bounded learned index with a spatial-aware partitioner that fits into Apache Spark without engine refactoring, enabling efficient point, range, kNN, and join queries. Empirical results show LiLIS delivers 2-3 orders of magnitude faster queries and 1.5-2× faster index construction compared with state-of-the-art systems like Sedona, across real-world and synthetic datasets. The work demonstrates the viability of distributed learned indices for big spatial data and suggests future directions toward alternative models and broader query support.

Abstract

The efficient management of big spatial data is crucial for location-based services, particularly in smart cities. However, existing systems such as Simba and Sedona, which incorporate distributed spatial indexing, still incur substantial index construction overheads, rendering them far from optimal for real-time analytics. Recent studies demonstrate that learned indices can achieve high efficiency through well-designed machine learning models, but how to design a learned index for distributed spatial analytics remains unaddressed. In this paper, we present LiLIS, a Lightweight Distributed Learned Index for big spatial data. LiLIS combines machine-learned search strategies with spatial-aware partitioning within a distributed framework, and efficiently implements common spatial queries, including point query, range query, k-nearest neighbors (kNN), and spatial joins. Extensive experimental results over real-world and synthetic datasets show that LiLIS outperforms state-of-the-art big spatial data analytics by $2-3$ orders of magnitude for most spatial queries, and the index building achieves $1.5-2\times$ speed-up. The code is available at https://github.com/SWUFE-DB-Group/learned-index-spark.

LiLIS: Enhancing Big Spatial Data Processing with Lightweight Distributed Learned Index

TL;DR

LiLIS tackles the substantial index construction and query overhead observed in distributed big spatial analytics. It introduces a spline-based, error-bounded learned index with a spatial-aware partitioner that fits into Apache Spark without engine refactoring, enabling efficient point, range, kNN, and join queries. Empirical results show LiLIS delivers 2-3 orders of magnitude faster queries and 1.5-2× faster index construction compared with state-of-the-art systems like Sedona, across real-world and synthetic datasets. The work demonstrates the viability of distributed learned indices for big spatial data and suggests future directions toward alternative models and broader query support.

Abstract

The efficient management of big spatial data is crucial for location-based services, particularly in smart cities. However, existing systems such as Simba and Sedona, which incorporate distributed spatial indexing, still incur substantial index construction overheads, rendering them far from optimal for real-time analytics. Recent studies demonstrate that learned indices can achieve high efficiency through well-designed machine learning models, but how to design a learned index for distributed spatial analytics remains unaddressed. In this paper, we present LiLIS, a Lightweight Distributed Learned Index for big spatial data. LiLIS combines machine-learned search strategies with spatial-aware partitioning within a distributed framework, and efficiently implements common spatial queries, including point query, range query, k-nearest neighbors (kNN), and spatial joins. Extensive experimental results over real-world and synthetic datasets show that LiLIS outperforms state-of-the-art big spatial data analytics by orders of magnitude for most spatial queries, and the index building achieves speed-up. The code is available at https://github.com/SWUFE-DB-Group/learned-index-spark.

Paper Structure

This paper contains 28 sections, 3 equations, 9 figures, 4 tables, 3 algorithms.

Figures (9)

  • Figure 1: An illustration of a learned index in which the key is a spatial coordinate $(x, y)$ with a value $v$, and a model maps the key to the position in memory/disk directly with an error bound.
  • Figure 2: The architecture of LiLIS on Apache Spark. It introduces spatial RDD with learned index, and supports a wide range of spatial queries.
  • Figure 3: The main idea of spline index is to obtain an estimated position $\hat{p}$ of a given key $k$ based on two adjacent points (Step 1), and then perform a binary search within positions $\hat{p} \pm \epsilon$ in sorted dataset (Step 2). The overall searching time complexity is constant after retrieving the lower bound $k_l$ and upper bound $k_r$.
  • Figure 4: The overall performance under default settings.
  • Figure 5: The throughput (jobs per minute) of point and range queries under default settings.
  • ...and 4 more figures