Learning-Augmented Search Data Structures
Chunkai Fu, Brandon G. Nguyen, Jung Hoon Seo, Ryan Zesch, Samson Zhou
TL;DR
This work investigates learning-augmented search data structures that integrate ML-provided query advice into skip lists and KD trees. By modeling predicted per-item query frequencies p_i and ground-truth frequencies f_i, the authors derive provable guarantees: with a perfect oracle the expected search time is bounded by $20 + 2 * sum_{i=1}^n f_i * min(log(1/p_i), log n)$ and an information-theoretic lower bound $H(f)$, with robustness to incorrect predictions. The proposed skip lists promote items to higher levels based on p_i, achieving near-optimal performance and constant-factor robustness under noisy predictions; similarly, the KD-tree construction balances splits using the learned distribution and truncates low-prob queries to maintain efficiency, with a lower bound tied to $H(f)$. Empirical evaluations on Zipfian synthetic data, CAIDA/AOL traces, point clouds, n-grams, and neural activation data demonstrate substantial improvements in query times and lookup depths, confirming both theoretical guarantees and practical viability for skewed and high-dimensional datasets.
Abstract
We study the integration of machine learning advice to improve upon traditional data structure designed for efficient search queries. Although there has been recent effort in improving the performance of binary search trees using machine learning advice, e.g., Lin et. al. (ICML 2022), the resulting constructions nevertheless suffer from inherent weaknesses of binary search trees, such as complexity of maintaining balance across multiple updates and the inability to handle partially-ordered or high-dimensional datasets. For these reasons, we focus on skip lists and KD trees in this work. Given access to a possibly erroneous oracle that outputs estimated fractional frequencies for search queries on a set of items, we construct skip lists and KD trees that provably provides the optimal expected search time, within nearly a factor of two. In fact, our learning-augmented skip lists and KD trees are still optimal up to a constant factor, even if the oracle is only accurate within a constant factor. We also demonstrate robustness by showing that our data structures achieves an expected search time that is within a constant factor of an oblivious skip list/KD tree construction even when the predictions are arbitrarily incorrect. Finally, we empirically show that our learning-augmented search data structures outperforms their corresponding traditional analogs on both synthetic and real-world datasets.
