Why Are Learned Indexes So Effective but Sometimes Ineffective?

Qiyu Liu; Siyuan Han; Yanlin Qi; Jingshu Peng; Jin Li; Longlong Lin; Lei Chen

Why Are Learned Indexes So Effective but Sometimes Ineffective?

Qiyu Liu, Siyuan Han, Yanlin Qi, Jingshu Peng, Jin Li, Longlong Lin, Lei Chen

TL;DR

This work reexamines the PGM-Index, a learned index based on error-bounded piecewise linear approximation, by proving a sub-logarithmic lookup bound $O(\log\log N)$ under linear space $O(N/G)$ and by diagnosing why it often underperforms in practice due to memory-bound internal searches. It introduces PGM++ as a simple, effective enhancement that uses a hybrid search strategy and automated parameter tuning to optimize space-time trade-offs. Across real and synthetic workloads, PGM++ achieves up to $2.31\times$ speedups over the original PGM-Index and up to $1.56\times$ wins against state-of-the-art learned indexes under comparable space, establishing a new Pareto frontier. The paper also provides a workload-independent cost model and tuning methodology that improves robustness and applicability in practical DBMS settings.

Abstract

Learned indexes have attracted significant research interest due to their ability to offer better space-time trade-offs compared to traditional B+-tree variants. Among various learned indexes, the PGM-Index based on error-bounded piecewise linear approximation is an elegant data structure that has demonstrated \emph{provably} superior performance over conventional B+-tree indexes. In this paper, we explore two interesting research questions regarding the PGM-Index: (a) \emph{Why are PGM-Indexes theoretically effective?} and (b) \emph{Why do PGM-Indexes underperform in practice?} For question~(a), we first prove that, for a set of $N$ sorted keys, the PGM-Index can, with high probability, achieve a lookup time of $O(\log\log N)$ while using $O(N)$ space. To the best of our knowledge, this is the \textbf{tightest bound} for learned indexes to date. For question~(b), we identify that querying PGM-Indexes is highly memory-bound, where the internal error-bounded search operations often become the bottleneck. To fill the performance gap, we propose PGM++, a \emph{simple yet effective} extension to the original PGM-Index that employs a mixture of different search strategies, with hyper-parameters automatically tuned through a calibrated cost model. Extensive experiments on real workloads demonstrate that PGM++ establishes a new Pareto frontier. At comparable space costs, PGM++ speeds up index lookup queries by up to $\mathbf{2.31\times}$ and $\mathbf{1.56\times}$ when compared to the original PGM-Index and state-of-the-art learned indexes.

Why Are Learned Indexes So Effective but Sometimes Ineffective?

TL;DR

This work reexamines the PGM-Index, a learned index based on error-bounded piecewise linear approximation, by proving a sub-logarithmic lookup bound

under linear space

and by diagnosing why it often underperforms in practice due to memory-bound internal searches. It introduces PGM++ as a simple, effective enhancement that uses a hybrid search strategy and automated parameter tuning to optimize space-time trade-offs. Across real and synthetic workloads, PGM++ achieves up to

speedups over the original PGM-Index and up to

wins against state-of-the-art learned indexes under comparable space, establishing a new Pareto frontier. The paper also provides a workload-independent cost model and tuning methodology that improves robustness and applicability in practical DBMS settings.

Abstract

sorted keys, the PGM-Index can, with high probability, achieve a lookup time of

while using

space. To the best of our knowledge, this is the \textbf{tightest bound} for learned indexes to date. For question~(b), we identify that querying PGM-Indexes is highly memory-bound, where the internal error-bounded search operations often become the bottleneck. To fill the performance gap, we propose PGM++, a \emph{simple yet effective} extension to the original PGM-Index that employs a mixture of different search strategies, with hyper-parameters automatically tuned through a calibrated cost model. Extensive experiments on real workloads demonstrate that PGM++ establishes a new Pareto frontier. At comparable space costs, PGM++ speeds up index lookup queries by up to

and

when compared to the original PGM-Index and state-of-the-art learned indexes.

Paper Structure (18 sections, 6 theorems, 15 equations, 14 figures, 7 tables)

This paper contains 18 sections, 6 theorems, 15 equations, 14 figures, 7 tables.

Introduction
Preliminaries
Learned Index
Existing Theoretical Results
Microbenchmark Setting
Why Are PGM-Indexes So Effective?
Motivation Experiments
Theoretical Analysis
Case Study: Uniform Keys
Why Are PGM-Indexes Ineffective?
PGM++: Optimization to PGM-Index
Hybrid Search Strategy
Calibrated Cost Model
Experimental Study
Overall Evaluation
...and 3 more sections

Key Result

theorem 1

Given a consecutive chunk of $2\epsilon+1$ sorted keys $\{k_i,\cdots,k_{i+2\epsilon}\}\subseteq\mathcal{K}$, there exists a horizontal line segment $\ell(x)=i+\epsilon$ such that $|\ell(k_j)-j|\leq\epsilon$ holds for $j=i,\cdots,i+2\epsilon$, implying that each line segment in an $\epsilon$-PLA can

Figures (14)

Figure 1: (a) A conventional B+-tree index. (b) A learned index with a "last-mile" maximum search error $\epsilon$.
Figure 2: A toy example of a 3-level PGM-Index with $\epsilon_i=1$ (i.e., internal search error range) and $\epsilon_\ell=4$ (i.e., last-mile search error range). Processing a lookup query on such PGM-Index involves in total three linear function evaluations, two internal search operations in the range $2\cdot\epsilon_i+1$, and one "last-mile" search operation on the sorted data array in the range $2\cdot\epsilon_\ell+1$.
Figure 3: Gap distributions for 4 real datasets. In the box plots, the horizontal lines and the star marks refer to the medians and means of data, respectively.
Figure 4: Illustration of gaps for the next level. Suppose $G$ is the segment coverage for the current level. The new gap in the $i$-th level is $g^{(i)}=\sum_{j'=j+1}^{j+G-1}g^{(i-1)}_{j'}$ where $g^{(i-1)}_{j'}$ is the $j'$-th gap in the $(i-1)$-th level.
Figure 5: Illustration of the CPU cycles used for searching an $(\epsilon_i,\epsilon_\ell)$-PGM-Index with a standard binary search algorithm for the internal error-bounded search operation.
...and 9 more figures

Theorems & Definitions (8)

definition 1: $\epsilon$-PLA
definition 2: $(\epsilon_i, \epsilon_\ell)$-PGM-Index DBLP:journals/pvldb/FerraginaV20
theorem 1: PGM-Index Lower Bound DBLP:journals/pvldb/FerraginaV20
theorem 2: Expected Line Segment Coverage DBLP:conf/icml/FerraginaLV20
lemma 1: Expected Coverage Recursion
lemma 2: Expected Coverage of Level-$i$
theorem 3: PGM-Index Height
theorem 4: Space and Time Complexity

Why Are Learned Indexes So Effective but Sometimes Ineffective?

TL;DR

Abstract

Why Are Learned Indexes So Effective but Sometimes Ineffective?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (8)