Table of Contents
Fetching ...

Learned Indexes with Distribution Smoothing via Virtual Points

Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey

TL;DR

This paper reframes learned indexes as CDF approximators and tackles performance bottlenecks caused by hard-to-learn regions by smoothing the key distribution rather than altering index structures. It introduces CSV, an efficient algorithm that inserts virtual points to transform the CDF and integrates this into existing hierarchical indexes, balancing traversal and leaf-search costs via a practical cost model. The approach is formalized for both single-linear-model and hierarchical-index settings, proven NP-hard for exact smoothing, and paired with greedy approximations that achieve near-optimal loss reduction with linear-time complexity per smoothing step. Empirical results on four real datasets show up to 34% query-time improvements and substantial promotion of keys to upper levels, with modest storage overhead, demonstrating practical impact for large-scale, read-heavy workloads and real-world deployments.

Abstract

Recent research on learned indexes has created a new perspective for indexes as models that map keys to their respective storage locations. These learned indexes are created to approximate the cumulative distribution function of the key set, where using only a single model may have limited accuracy. To overcome this limitation, a typical method is to use multiple models, arranged in a hierarchical manner, where the query performance depends on two aspects: (i) traversal time to find the correct model and (ii) search time to find the key in the selected model. Such a method may cause some key space regions that are difficult to model to be placed at deeper levels in the hierarchy. To address this issue, we propose an alternative method that modifies the key space as opposed to any structural or model modifications. This is achieved through making the key set more learnable (i.e., smoothing the distribution) by inserting virtual points. Furthermore, we develop an algorithm named CSV to integrate our virtual point insertion method into existing learned indexes, reducing both their traversal and search time. We implement CSV on state-of-the-art learned indexes and evaluate them on real-world datasets. Extensive experimental results show significant query performance improvement for the keys in deeper levels of the index structures at a low storage cost.

Learned Indexes with Distribution Smoothing via Virtual Points

TL;DR

This paper reframes learned indexes as CDF approximators and tackles performance bottlenecks caused by hard-to-learn regions by smoothing the key distribution rather than altering index structures. It introduces CSV, an efficient algorithm that inserts virtual points to transform the CDF and integrates this into existing hierarchical indexes, balancing traversal and leaf-search costs via a practical cost model. The approach is formalized for both single-linear-model and hierarchical-index settings, proven NP-hard for exact smoothing, and paired with greedy approximations that achieve near-optimal loss reduction with linear-time complexity per smoothing step. Empirical results on four real datasets show up to 34% query-time improvements and substantial promotion of keys to upper levels, with modest storage overhead, demonstrating practical impact for large-scale, read-heavy workloads and real-world deployments.

Abstract

Recent research on learned indexes has created a new perspective for indexes as models that map keys to their respective storage locations. These learned indexes are created to approximate the cumulative distribution function of the key set, where using only a single model may have limited accuracy. To overcome this limitation, a typical method is to use multiple models, arranged in a hierarchical manner, where the query performance depends on two aspects: (i) traversal time to find the correct model and (ii) search time to find the key in the selected model. Such a method may cause some key space regions that are difficult to model to be placed at deeper levels in the hierarchy. To address this issue, we propose an alternative method that modifies the key space as opposed to any structural or model modifications. This is achieved through making the key set more learnable (i.e., smoothing the distribution) by inserting virtual points. Furthermore, we develop an algorithm named CSV to integrate our virtual point insertion method into existing learned indexes, reducing both their traversal and search time. We implement CSV on state-of-the-art learned indexes and evaluate them on real-world datasets. Extensive experimental results show significant query performance improvement for the keys in deeper levels of the index structures at a low storage cost.
Paper Structure (20 sections, 1 theorem, 21 equations, 10 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 1 theorem, 21 equations, 10 figures, 4 tables, 2 algorithms.

Key Result

lemma 1

Learned index CDF smoothing is NP-hard.

Figures (10)

  • Figure 1: Query time at each level of the LIPP index for four real datasets, each with 200 million keys.
  • Figure 2: Indexing data points (keys) with CDF smoothing.
  • Figure 3: Loss function (SSE) value corresponding to different insertion positions for a virtual point.
  • Figure 4: First-order partial derivatives of the loss (Equation \ref{['eq:loss_derv']}) with respect to the key value of a virtual point $k_v$
  • Figure 5: CDFs of the datasets
  • ...and 5 more figures

Theorems & Definitions (2)

  • Definition 1
  • lemma 1