Table of Contents
Fetching ...

Learned Static Function Data Structures

Stefan Hermann, Hans-Peter Lehmann, Giorgio Vinciguerra, Stefan Walzer

TL;DR

The paper tackles space-efficient key-value storage when queries are restricted to a fixed key set by introducing learned static functions (LSFs). LSFs combine a machine-learned model that outputs per-key value distributions with an efficient, per-key prefix-code encoding stored in a variable-length static function, allowing space below the zero-order entropy H0 under learnable correlations. Key innovations include extending BuRR to variable-length outputs (VL-SF), a generalized filter trick that reduces overhead, and a weighted relative membership framework that enables per-key decision coding; calibration of the model further tightens space bounds. Experiments on real and synthetic datasets demonstrate substantial space savings (up to 10× real data, up to 1000× synthetic) with competitive query performance, validating the practicality of LSFs for memory-constrained domains such as GIS, security, biology, and data-intensive databases.

Abstract

We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.

Learned Static Function Data Structures

TL;DR

The paper tackles space-efficient key-value storage when queries are restricted to a fixed key set by introducing learned static functions (LSFs). LSFs combine a machine-learned model that outputs per-key value distributions with an efficient, per-key prefix-code encoding stored in a variable-length static function, allowing space below the zero-order entropy H0 under learnable correlations. Key innovations include extending BuRR to variable-length outputs (VL-SF), a generalized filter trick that reduces overhead, and a weighted relative membership framework that enables per-key decision coding; calibration of the model further tightens space bounds. Experiments on real and synthetic datasets demonstrate substantial space savings (up to 10× real data, up to 1000× synthetic) with competitive query performance, validating the practicality of LSFs for memory-constrained domains such as GIS, security, biology, and data-intensive databases.

Abstract

We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.

Paper Structure

This paper contains 44 sections, 4 equations, 5 figures, 5 tables, 4 algorithms.

Figures (5)

  • Figure 1: Architecture of learned static functions. In gray, a simple implementation that stores Huffman codes in the auxiliary data structure $\mathcal{D}$.
  • Figure 2: Structure of the BuRR equation system DillingerHSW2022burr.
  • Figure 3: A VL-BuRR data structure using $ℓ_{\max}$ separate $1$-bit SFs, given by column vectors $Z₀,…,Z_{ℓ_{\max}-1}$ (here $ℓ_{\max} = 6$). Shaded areas are two examples of the bits accessed by a single query. Grey dots indicate individual bits, dotted lines indicate how the bits are grouped into words of length $w$, and numbers indicate the order in which these words are stored.
  • Figure 4: Idealised space overhead of filter-based WRM for keys with weight $p ∈ [0,\frac{1}{2}]$. Optimal filter shows the overhead $(\text{space}(p,r^*_{ℝ_{≥0}}(p))-H(p))/H(p)$ when using a weighted filter that supports real-valued weights. The maximum is $\approx 0.086$ at $p \approx 0.15$. Optimal binary filter shows the overhead $(\text{space}(p,r^*_{ℕ₀}(p))-H(p))/H(p)$ when using a weighted filter that supports integer weights only. The maximum is $\approx0.108$ at $p = 0.2$.
  • Figure 5: Query time, inference time (dashed) and space overhead on the gauss dataset varying # classes and $S(ℳ,f)$.