Table of Contents
Fetching ...

Why Are Learned Indexes So Effective but Sometimes Ineffective?

Qiyu Liu, Siyuan Han, Yanlin Qi, Jingshu Peng, Jin Li, Longlong Lin, Lei Chen

TL;DR

This work reexamines the PGM-Index, a learned index based on error-bounded piecewise linear approximation, by proving a sub-logarithmic lookup bound $O(\log\log N)$ under linear space $O(N/G)$ and by diagnosing why it often underperforms in practice due to memory-bound internal searches. It introduces PGM++ as a simple, effective enhancement that uses a hybrid search strategy and automated parameter tuning to optimize space-time trade-offs. Across real and synthetic workloads, PGM++ achieves up to $2.31\times$ speedups over the original PGM-Index and up to $1.56\times$ wins against state-of-the-art learned indexes under comparable space, establishing a new Pareto frontier. The paper also provides a workload-independent cost model and tuning methodology that improves robustness and applicability in practical DBMS settings.

Abstract

Learned indexes have attracted significant research interest due to their ability to offer better space-time trade-offs compared to traditional B+-tree variants. Among various learned indexes, the PGM-Index based on error-bounded piecewise linear approximation is an elegant data structure that has demonstrated \emph{provably} superior performance over conventional B+-tree indexes. In this paper, we explore two interesting research questions regarding the PGM-Index: (a) \emph{Why are PGM-Indexes theoretically effective?} and (b) \emph{Why do PGM-Indexes underperform in practice?} For question~(a), we first prove that, for a set of $N$ sorted keys, the PGM-Index can, with high probability, achieve a lookup time of $O(\log\log N)$ while using $O(N)$ space. To the best of our knowledge, this is the \textbf{tightest bound} for learned indexes to date. For question~(b), we identify that querying PGM-Indexes is highly memory-bound, where the internal error-bounded search operations often become the bottleneck. To fill the performance gap, we propose PGM++, a \emph{simple yet effective} extension to the original PGM-Index that employs a mixture of different search strategies, with hyper-parameters automatically tuned through a calibrated cost model. Extensive experiments on real workloads demonstrate that PGM++ establishes a new Pareto frontier. At comparable space costs, PGM++ speeds up index lookup queries by up to $\mathbf{2.31\times}$ and $\mathbf{1.56\times}$ when compared to the original PGM-Index and state-of-the-art learned indexes.

Why Are Learned Indexes So Effective but Sometimes Ineffective?

TL;DR

This work reexamines the PGM-Index, a learned index based on error-bounded piecewise linear approximation, by proving a sub-logarithmic lookup bound under linear space and by diagnosing why it often underperforms in practice due to memory-bound internal searches. It introduces PGM++ as a simple, effective enhancement that uses a hybrid search strategy and automated parameter tuning to optimize space-time trade-offs. Across real and synthetic workloads, PGM++ achieves up to speedups over the original PGM-Index and up to wins against state-of-the-art learned indexes under comparable space, establishing a new Pareto frontier. The paper also provides a workload-independent cost model and tuning methodology that improves robustness and applicability in practical DBMS settings.

Abstract

Learned indexes have attracted significant research interest due to their ability to offer better space-time trade-offs compared to traditional B+-tree variants. Among various learned indexes, the PGM-Index based on error-bounded piecewise linear approximation is an elegant data structure that has demonstrated \emph{provably} superior performance over conventional B+-tree indexes. In this paper, we explore two interesting research questions regarding the PGM-Index: (a) \emph{Why are PGM-Indexes theoretically effective?} and (b) \emph{Why do PGM-Indexes underperform in practice?} For question~(a), we first prove that, for a set of sorted keys, the PGM-Index can, with high probability, achieve a lookup time of while using space. To the best of our knowledge, this is the \textbf{tightest bound} for learned indexes to date. For question~(b), we identify that querying PGM-Indexes is highly memory-bound, where the internal error-bounded search operations often become the bottleneck. To fill the performance gap, we propose PGM++, a \emph{simple yet effective} extension to the original PGM-Index that employs a mixture of different search strategies, with hyper-parameters automatically tuned through a calibrated cost model. Extensive experiments on real workloads demonstrate that PGM++ establishes a new Pareto frontier. At comparable space costs, PGM++ speeds up index lookup queries by up to and when compared to the original PGM-Index and state-of-the-art learned indexes.
Paper Structure (18 sections, 6 theorems, 15 equations, 14 figures, 7 tables)

This paper contains 18 sections, 6 theorems, 15 equations, 14 figures, 7 tables.

Key Result

theorem 1

Given a consecutive chunk of $2\epsilon+1$ sorted keys $\{k_i,\cdots,k_{i+2\epsilon}\}\subseteq\mathcal{K}$, there exists a horizontal line segment $\ell(x)=i+\epsilon$ such that $|\ell(k_j)-j|\leq\epsilon$ holds for $j=i,\cdots,i+2\epsilon$, implying that each line segment in an $\epsilon$-PLA can

Figures (14)

  • Figure 1: (a) A conventional B+-tree index. (b) A learned index with a "last-mile" maximum search error $\epsilon$.
  • Figure 2: A toy example of a 3-level PGM-Index with $\epsilon_i=1$ (i.e., internal search error range) and $\epsilon_\ell=4$ (i.e., last-mile search error range). Processing a lookup query on such PGM-Index involves in total three linear function evaluations, two internal search operations in the range $2\cdot\epsilon_i+1$, and one "last-mile" search operation on the sorted data array in the range $2\cdot\epsilon_\ell+1$.
  • Figure 3: Gap distributions for 4 real datasets. In the box plots, the horizontal lines and the star marks refer to the medians and means of data, respectively.
  • Figure 4: Illustration of gaps for the next level. Suppose $G$ is the segment coverage for the current level. The new gap in the $i$-th level is $g^{(i)}=\sum_{j'=j+1}^{j+G-1}g^{(i-1)}_{j'}$ where $g^{(i-1)}_{j'}$ is the $j'$-th gap in the $(i-1)$-th level.
  • Figure 5: Illustration of the CPU cycles used for searching an $(\epsilon_i,\epsilon_\ell)$-PGM-Index with a standard binary search algorithm for the internal error-bounded search operation.
  • ...and 9 more figures

Theorems & Definitions (8)

  • definition 1: $\epsilon$-PLA
  • definition 2: $(\epsilon_i, \epsilon_\ell)$-PGM-Index DBLP:journals/pvldb/FerraginaV20
  • theorem 1: PGM-Index Lower Bound DBLP:journals/pvldb/FerraginaV20
  • theorem 2: Expected Line Segment Coverage DBLP:conf/icml/FerraginaLV20
  • lemma 1: Expected Coverage Recursion
  • lemma 2: Expected Coverage of Level-$i$
  • theorem 3: PGM-Index Height
  • theorem 4: Space and Time Complexity