Table of Contents
Fetching ...

Learning-Augmented Search Data Structures

Chunkai Fu, Brandon G. Nguyen, Jung Hoon Seo, Ryan Zesch, Samson Zhou

TL;DR

This work investigates learning-augmented search data structures that integrate ML-provided query advice into skip lists and KD trees. By modeling predicted per-item query frequencies p_i and ground-truth frequencies f_i, the authors derive provable guarantees: with a perfect oracle the expected search time is bounded by $20 + 2 * sum_{i=1}^n f_i * min(log(1/p_i), log n)$ and an information-theoretic lower bound $H(f)$, with robustness to incorrect predictions. The proposed skip lists promote items to higher levels based on p_i, achieving near-optimal performance and constant-factor robustness under noisy predictions; similarly, the KD-tree construction balances splits using the learned distribution and truncates low-prob queries to maintain efficiency, with a lower bound tied to $H(f)$. Empirical evaluations on Zipfian synthetic data, CAIDA/AOL traces, point clouds, n-grams, and neural activation data demonstrate substantial improvements in query times and lookup depths, confirming both theoretical guarantees and practical viability for skewed and high-dimensional datasets.

Abstract

We study the integration of machine learning advice to improve upon traditional data structure designed for efficient search queries. Although there has been recent effort in improving the performance of binary search trees using machine learning advice, e.g., Lin et. al. (ICML 2022), the resulting constructions nevertheless suffer from inherent weaknesses of binary search trees, such as complexity of maintaining balance across multiple updates and the inability to handle partially-ordered or high-dimensional datasets. For these reasons, we focus on skip lists and KD trees in this work. Given access to a possibly erroneous oracle that outputs estimated fractional frequencies for search queries on a set of items, we construct skip lists and KD trees that provably provides the optimal expected search time, within nearly a factor of two. In fact, our learning-augmented skip lists and KD trees are still optimal up to a constant factor, even if the oracle is only accurate within a constant factor. We also demonstrate robustness by showing that our data structures achieves an expected search time that is within a constant factor of an oblivious skip list/KD tree construction even when the predictions are arbitrarily incorrect. Finally, we empirically show that our learning-augmented search data structures outperforms their corresponding traditional analogs on both synthetic and real-world datasets.

Learning-Augmented Search Data Structures

TL;DR

This work investigates learning-augmented search data structures that integrate ML-provided query advice into skip lists and KD trees. By modeling predicted per-item query frequencies p_i and ground-truth frequencies f_i, the authors derive provable guarantees: with a perfect oracle the expected search time is bounded by and an information-theoretic lower bound , with robustness to incorrect predictions. The proposed skip lists promote items to higher levels based on p_i, achieving near-optimal performance and constant-factor robustness under noisy predictions; similarly, the KD-tree construction balances splits using the learned distribution and truncates low-prob queries to maintain efficiency, with a lower bound tied to . Empirical evaluations on Zipfian synthetic data, CAIDA/AOL traces, point clouds, n-grams, and neural activation data demonstrate substantial improvements in query times and lookup depths, confirming both theoretical guarantees and practical viability for skewed and high-dimensional datasets.

Abstract

We study the integration of machine learning advice to improve upon traditional data structure designed for efficient search queries. Although there has been recent effort in improving the performance of binary search trees using machine learning advice, e.g., Lin et. al. (ICML 2022), the resulting constructions nevertheless suffer from inherent weaknesses of binary search trees, such as complexity of maintaining balance across multiple updates and the inability to handle partially-ordered or high-dimensional datasets. For these reasons, we focus on skip lists and KD trees in this work. Given access to a possibly erroneous oracle that outputs estimated fractional frequencies for search queries on a set of items, we construct skip lists and KD trees that provably provides the optimal expected search time, within nearly a factor of two. In fact, our learning-augmented skip lists and KD trees are still optimal up to a constant factor, even if the oracle is only accurate within a constant factor. We also demonstrate robustness by showing that our data structures achieves an expected search time that is within a constant factor of an oblivious skip list/KD tree construction even when the predictions are arbitrarily incorrect. Finally, we empirically show that our learning-augmented search data structures outperforms their corresponding traditional analogs on both synthetic and real-world datasets.
Paper Structure (22 sections, 28 theorems, 16 equations, 11 figures, 4 tables, 3 algorithms)

This paper contains 22 sections, 28 theorems, 16 equations, 11 figures, 4 tables, 3 algorithms.

Key Result

Theorem 2.1

For each $i\in[n]$, let $f_i$ and $p_i$ be the proportion of true and predicted queries to item $i$. Then with probability at least $0.99$ over the randomness of the construction of the skip list, the expected search time over the choice of queries at most $20+2\sum_{i=1}^n f_i\cdot\min\left(\log\fr

Figures (11)

  • Figure 1: CAIDA datasets distribution characterization in \ref{['fig:fig:sub3a']}. The nearly straight-fitted curve in \ref{['fig:fig:sub3b']} implies that a Zipfian distribution with $\alpha=1.37$ is a good fit to the CAIDA dataset distribution.
  • Figure 2: Comparison of insertion and query time on CAIDA for classic and learning-augmented skip lists. This figure compares the insertion and query times under varying numbers of top frequently accessed unique IPs between classic and augmented implementations. The horizontal axis in the two subfigures depicts the same scheme of IP selection, represented in two different ways, e.g., the top 29.9 million queries contain 665210 unique IPs, the next 29.5 million queries comprise 296384 unique IPs, etc.
  • Figure 3: Robustness of our learning-augmented skip list to erroneous oracles. In Figure \ref{['fig:intersection_index']}, the labels on the axis indicate the time stamp that the internet trace data is collected, e.g., 130100 means the collection starts at 13:01:00 and lasts for 1 minute.
  • Figure 4: Query time comparison for standard and learning-augmented KD trees with various noise.
  • Figure 5: Comparison of query time on learning-augmented KD trees with and without smooth spatial distribution across various Zipfian parameters
  • ...and 6 more figures

Theorems & Definitions (43)

  • Theorem 2.1
  • Theorem 2.2
  • Lemma 2.2
  • Corollary 2.2
  • Lemma 2.2
  • Theorem 3.1
  • Theorem 3.2
  • Corollary 3.2
  • Lemma 3.2
  • Lemma B.1
  • ...and 33 more