Why Are Learned Indexes So Effective but Sometimes Ineffective?
Qiyu Liu, Siyuan Han, Yanlin Qi, Jingshu Peng, Jin Li, Longlong Lin, Lei Chen
TL;DR
This work reexamines the PGM-Index, a learned index based on error-bounded piecewise linear approximation, by proving a sub-logarithmic lookup bound $O(\log\log N)$ under linear space $O(N/G)$ and by diagnosing why it often underperforms in practice due to memory-bound internal searches. It introduces PGM++ as a simple, effective enhancement that uses a hybrid search strategy and automated parameter tuning to optimize space-time trade-offs. Across real and synthetic workloads, PGM++ achieves up to $2.31\times$ speedups over the original PGM-Index and up to $1.56\times$ wins against state-of-the-art learned indexes under comparable space, establishing a new Pareto frontier. The paper also provides a workload-independent cost model and tuning methodology that improves robustness and applicability in practical DBMS settings.
Abstract
Learned indexes have attracted significant research interest due to their ability to offer better space-time trade-offs compared to traditional B+-tree variants. Among various learned indexes, the PGM-Index based on error-bounded piecewise linear approximation is an elegant data structure that has demonstrated \emph{provably} superior performance over conventional B+-tree indexes. In this paper, we explore two interesting research questions regarding the PGM-Index: (a) \emph{Why are PGM-Indexes theoretically effective?} and (b) \emph{Why do PGM-Indexes underperform in practice?} For question~(a), we first prove that, for a set of $N$ sorted keys, the PGM-Index can, with high probability, achieve a lookup time of $O(\log\log N)$ while using $O(N)$ space. To the best of our knowledge, this is the \textbf{tightest bound} for learned indexes to date. For question~(b), we identify that querying PGM-Indexes is highly memory-bound, where the internal error-bounded search operations often become the bottleneck. To fill the performance gap, we propose PGM++, a \emph{simple yet effective} extension to the original PGM-Index that employs a mixture of different search strategies, with hyper-parameters automatically tuned through a calibrated cost model. Extensive experiments on real workloads demonstrate that PGM++ establishes a new Pareto frontier. At comparable space costs, PGM++ speeds up index lookup queries by up to $\mathbf{2.31\times}$ and $\mathbf{1.56\times}$ when compared to the original PGM-Index and state-of-the-art learned indexes.
