Querying in Constant Expected Time with Learned Indexes

Luis Croquevielle; Guang Yang; Liang Liang; Ali Hadian; Thomas Heinis

Querying in Constant Expected Time with Learned Indexes

Luis Croquevielle, Guang Yang, Liang Liang, Ali Hadian, Thomas Heinis

TL;DR

It is proved that O(1)$ expected time can be achieved with at most linear space, thereby establishing the tightest upper bound so far for the time complexity of an asymptotically optimal learned index.

Abstract

Learned indexes leverage machine learning models to accelerate query answering in databases, showing impressive practical performance. However, theoretical understanding of these methods remains incomplete. Existing research suggests that learned indexes have superior asymptotic complexity compared to their non-learned counterparts, but these findings have been established under restrictive probabilistic assumptions. Specifically, for a sorted array with $n$ elements, it has been shown that learned indexes can find a key in $O(\log(\log n))$ expected time using at most linear space, compared with $O(\log n)$ for non-learned methods. In this work, we prove $O(1)$ expected time can be achieved with at most linear space, thereby establishing the tightest upper bound so far for the time complexity of an asymptotically optimal learned index. Notably, we use weaker probabilistic assumptions than prior research, meaning our work generalizes previous results. Furthermore, we introduce a new measure of statistical complexity for data. This metric exhibits an information-theoretical interpretation and can be estimated in practice. This characterization provides further theoretical understanding of learned indexes, by helping to explain why some datasets seem to be particularly challenging for these methods.

Querying in Constant Expected Time with Learned Indexes

TL;DR

Abstract

elements, it has been shown that learned indexes can find a key in

expected time using at most linear space, compared with

for non-learned methods. In this work, we prove

expected time can be achieved with at most linear space, thereby establishing the tightest upper bound so far for the time complexity of an asymptotically optimal learned index. Notably, we use weaker probabilistic assumptions than prior research, meaning our work generalizes previous results. Furthermore, we introduce a new measure of statistical complexity for data. This metric exhibits an information-theoretical interpretation and can be estimated in practice. This characterization provides further theoretical understanding of learned indexes, by helping to explain why some datasets seem to be particularly challenging for these methods.

Paper Structure (35 sections, 9 theorems, 23 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 35 sections, 9 theorems, 23 equations, 5 figures, 2 tables, 2 algorithms.

Introduction
Preliminaries
Data
Learning Problem
Model of Computation
Main Results and ESPC Index
Complexity Bounds
Equal-Split Piecewise Constant Index
Complexity Proofs and Analysis
Preliminary Results
Main Probabilistic Bound
Proof of Asymptotic Complexity
Distribution of query parameter
Unbounded support
Analysis and Benchmarking
...and 20 more sections

Key Result

Theorem 1

Suppose $f$ has support $[a, b]$ and $\rho_{ f} < \infty$. Define $\rho=\log((b-a)\rho_{ f})$. Then, there is a procedure $R_n$ for building learned indexes such that $\overline{S}_n=O(n)$ and $\overline{T}_n=O(\rho)$. That is, for array $A$ with $n$ keys an index can be built with space overhead $O

Figures (5)

Figure 1: Predictive and corrective steps of a learned index.
Figure 2: Partition of key range into four equal-length intervals, with approximator of $\mathbf{rank}$.
Figure 3: Equal-length partitions for $[X_{(1)}, X_{(n)}]$ (given by $\{I_k\}$) and $[a, b]$ (given by $\{J_k\}$).
Figure 4: Average experimental error and theoretical bound for expected error, as functions of the number $K$ of subintervals. Top row shows synthetic datasets, bottom row shows real-world datasets. The plots use a logarithmic scale for both axes, which prevents clustering of the points.
Figure 5: $\hat{\rho}_f$ estimate for normal, amzn and osm datasets as number of samples $n$ increases. The estimation process uses the histogram method to approximate the density $f$.

Theorems & Definitions (9)

Theorem 1
Theorem 2
Proposition 3
Lemma 4
Theorem 5
Proposition 6
Proposition 7
Theorem 8
Theorem 9

Querying in Constant Expected Time with Learned Indexes

TL;DR

Abstract

Querying in Constant Expected Time with Learned Indexes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (9)