Table of Contents
Fetching ...

AutoCSF: Provably Space-Efficient Indexing of Skewed Key-Value Workloads via Filter-Augmented Compressed Static Functions

David Torres Ramos, Vihan Lakshman, Chen Luo, Todd Treangen, Benjamin Coleman

Abstract

We study the problem of building space-efficient, in-memory indexes for massive key-value datasets with highly skewed value distributions. This challenge arises in many data-intensive domains and is particularly acute in computational genomics, where $k$-mer count tables can contain billions of entries dominated by a single frequent value. While recent work has proposed to address this problem by augmenting compressed static functions (CSFs) with pre-filters, existing approaches rely on complex heuristics and lack formal guarantees. In this paper, we introduce a principled algorithm, called AutoCSF, for combining CSFs with pre-filtering to provably handle skewed distributions with near-optimal space usage. We improve upon prior CSF pre-filtering constructions by (1) deriving a mathematically rigorous decision criterion for when filter augmentation is beneficial; (2) presenting a general algorithmic framework for integrating CSFs with modern set membership data structures beyond the classic Bloom filter; and (3) establishing theoretical guarantees on the overall space usage of the resulting indexes. Our open-source implementation of AutoCSF demonstrates space savings over baseline methods while maintaining low query latency.

AutoCSF: Provably Space-Efficient Indexing of Skewed Key-Value Workloads via Filter-Augmented Compressed Static Functions

Abstract

We study the problem of building space-efficient, in-memory indexes for massive key-value datasets with highly skewed value distributions. This challenge arises in many data-intensive domains and is particularly acute in computational genomics, where -mer count tables can contain billions of entries dominated by a single frequent value. While recent work has proposed to address this problem by augmenting compressed static functions (CSFs) with pre-filters, existing approaches rely on complex heuristics and lack formal guarantees. In this paper, we introduce a principled algorithm, called AutoCSF, for combining CSFs with pre-filtering to provably handle skewed distributions with near-optimal space usage. We improve upon prior CSF pre-filtering constructions by (1) deriving a mathematically rigorous decision criterion for when filter augmentation is beneficial; (2) presenting a general algorithmic framework for integrating CSFs with modern set membership data structures beyond the classic Bloom filter; and (3) establishing theoretical guarantees on the overall space usage of the resulting indexes. Our open-source implementation of AutoCSF demonstrates space savings over baseline methods while maintaining low query latency.

Paper Structure

This paper contains 17 sections, 4 theorems, 39 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4.1

Given frequencies $F = \{f_1, f_2, ... f_{z}\}$ of values $V = \{v_1, v_2, ... v_{z}\}$ and a Huffman code $H(F)$ with lengths $\{l_1, l_2, ... l_{z}\}$, for any other uniquely decodable code with lengths $l'_i$.

Figures (4)

  • Figure 1: Results of sweeping $\alpha$ for each distribution across all four filter types. Each panel plots the lower bound (blue dashed), best empirical savings (red dashed), and the upper bound at the theory-guided and empirical-best parameters (light and dark blue solid). The vertical dashed line marks where the lower bound crosses zero.
  • Figure 2: Results of sweeping $\varepsilon$ for each distribution across all four filter types.
  • Figure 3: Results of comparing heuristic and AutoCSF cost models
  • Figure 4: Memory--latency Pareto frontier across synthetic distributions and genomics datasets ($N = 100{,}000$ for synthetic, real sizes for genomics). Each trajectory sweeps $\alpha$ from 0.5 (right, more memory) to 0.99 (left, less memory). Log-log scale.

Theorems & Definitions (6)

  • Theorem 4.1: huffman1952method
  • Theorem 4.2
  • proof
  • Theorem 4.3
  • Theorem 4.4
  • proof