Table of Contents
Fetching ...

How Good Are Multi-dimensional Learned Indices? An Experimental Survey

Qiyu Liu, Maocheng Li, Yuxiang Zeng, Yanyan Shen, Lei Chen

TL;DR

The paper addresses the lack of a unified, rigorous evaluation of multi-dimensional learned indices. It classifies existing approaches into projection-, augmentation-, and grid-based families, and implements six representative indices under a single experimental framework with real and synthetic datasets. The study finds that projection- and grid-based learned indices can substantially reduce index size and accelerate range queries, but kNN queries and dynamic updates remain challenging, with no single method beating traditional spatial indices in all scenarios. These findings guide future design toward improved updateability, broader query support, and hardware-conscious optimizations for practical deployment.

Abstract

Efficient indexing is fundamental for multi-dimensional data management and analytics. An emerging tendency is to directly learn the storage layout of multi-dimensional data by simple machine learning models, yielding the concept of Learned Index. Compared with the conventional indices used for decades (e.g., kd-tree and R-tree variants), learned indices are empirically shown to be both space- and time-efficient on modern architectures. However, there lacks a comprehensive evaluation of existing multi-dimensional learned indices under a unified benchmark, which makes it difficult to decide the suitable index for specific data and queries and further prevents the deployment of learned indices in real application scenarios. In this paper, we present the first in-depth empirical study to answer the question of how good multi-dimensional learned indices are. Six recently published indices are evaluated under a unified experimental configuration including index implementation, datasets, query workloads, and evaluation metrics. We thoroughly investigate the evaluation results and discuss the findings that may provide insights for future learned index design.

How Good Are Multi-dimensional Learned Indices? An Experimental Survey

TL;DR

The paper addresses the lack of a unified, rigorous evaluation of multi-dimensional learned indices. It classifies existing approaches into projection-, augmentation-, and grid-based families, and implements six representative indices under a single experimental framework with real and synthetic datasets. The study finds that projection- and grid-based learned indices can substantially reduce index size and accelerate range queries, but kNN queries and dynamic updates remain challenging, with no single method beating traditional spatial indices in all scenarios. These findings guide future design toward improved updateability, broader query support, and hardware-conscious optimizations for practical deployment.

Abstract

Efficient indexing is fundamental for multi-dimensional data management and analytics. An emerging tendency is to directly learn the storage layout of multi-dimensional data by simple machine learning models, yielding the concept of Learned Index. Compared with the conventional indices used for decades (e.g., kd-tree and R-tree variants), learned indices are empirically shown to be both space- and time-efficient on modern architectures. However, there lacks a comprehensive evaluation of existing multi-dimensional learned indices under a unified benchmark, which makes it difficult to decide the suitable index for specific data and queries and further prevents the deployment of learned indices in real application scenarios. In this paper, we present the first in-depth empirical study to answer the question of how good multi-dimensional learned indices are. Six recently published indices are evaluated under a unified experimental configuration including index implementation, datasets, query workloads, and evaluation metrics. We thoroughly investigate the evaluation results and discuss the findings that may provide insights for future learned index design.
Paper Structure (27 sections, 4 equations, 17 figures, 8 tables)

This paper contains 27 sections, 4 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Illustration of limited comparison in previous studies.
  • Figure 2: Workflow of projection-based indices.
  • Figure 3: Illustration of the projection function of ML-Index DBLP:conf/edbt/DavitkovaM020 where the $k$-means centers are used as reference points.
  • Figure 4: Illustration of the projection function of LISA DBLP:conf/sigmod/Li0ZY020 based on a $3\times3$ grid partition. Note, the Lebesgue measure in 2-D space is the area of a rectangular region.
  • Figure 5: Illustration of IF-Index DBLP:conf/vldb/0001KH20 where dim is the selected sorting dimension, $\Delta$ is the maximum prediction error, and $a$, $b$ are the slope and interception of the linear model.
  • ...and 12 more figures

Theorems & Definitions (3)

  • definition thmcounterdefinition: Point
  • definition thmcounterdefinition: Range Query
  • definition thmcounterdefinition: $k$-Nearest Neighbor Query