Learned Static Function Data Structures
Stefan Hermann, Hans-Peter Lehmann, Giorgio Vinciguerra, Stefan Walzer
TL;DR
The paper tackles space-efficient key-value storage when queries are restricted to a fixed key set by introducing learned static functions (LSFs). LSFs combine a machine-learned model that outputs per-key value distributions with an efficient, per-key prefix-code encoding stored in a variable-length static function, allowing space below the zero-order entropy H0 under learnable correlations. Key innovations include extending BuRR to variable-length outputs (VL-SF), a generalized filter trick that reduces overhead, and a weighted relative membership framework that enables per-key decision coding; calibration of the model further tightens space bounds. Experiments on real and synthetic datasets demonstrate substantial space savings (up to 10× real data, up to 1000× synthetic) with competitive query performance, validating the practicality of LSFs for memory-constrained domains such as GIS, security, biology, and data-intensive databases.
Abstract
We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.
