Table of Contents
Fetching ...

HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

Haidong Rong, Jiashu Yao, Matthias Langer, Shijie Liu, Li Fan, Dongxin Wang, Jia He, Jinglin Chen, Jiaheng Rang, Julian Qian, Mengyao Xu, Fan Yu, Minseok Lee, Zehuan Wang, Even Oldridge

Abstract

Traditional GPU hash tables preserve every inserted key -- a dictionary assumption that wastes scarce High Bandwidth Memory (HBM) when embedding tables routinely exceed single-GPU capacity. We challenge this assumption with cache semantics, where policy-driven eviction is a first-class operation. We introduce HierarchicalKV (HKV), the first general-purpose GPU hash table library whose normal full-capacity operating contract is cache-semantic: each full-bucket upsert (update-or-insert) is resolved in place by eviction or admission rejection rather than by rehashing or capacity-induced failure. HKV co-designs four core mechanisms -- cache-line-aligned buckets, in-line score-driven upsert, score-based dynamic dual-bucket selection, and triple-group concurrency -- and uses tiered key-value separation as a scaling enabler beyond HBM. On an NVIDIA H100 NVL GPU, HKV achieves up to 3.9 billion key-value pairs per second (B-KV/s) find throughput, stable across load factors 0.50-1.00 (<5% variation), and delivers 1.4x higher find throughput than WarpCore (the strongest dictionary-semantic GPU baseline at lambda=0.50) and up to 2.6-9.4x over indirection-based GPU baselines. Since its open-source release in October 2022, HKV has been integrated into multiple open-source recommendation frameworks.

HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

Abstract

Traditional GPU hash tables preserve every inserted key -- a dictionary assumption that wastes scarce High Bandwidth Memory (HBM) when embedding tables routinely exceed single-GPU capacity. We challenge this assumption with cache semantics, where policy-driven eviction is a first-class operation. We introduce HierarchicalKV (HKV), the first general-purpose GPU hash table library whose normal full-capacity operating contract is cache-semantic: each full-bucket upsert (update-or-insert) is resolved in place by eviction or admission rejection rather than by rehashing or capacity-induced failure. HKV co-designs four core mechanisms -- cache-line-aligned buckets, in-line score-driven upsert, score-based dynamic dual-bucket selection, and triple-group concurrency -- and uses tiered key-value separation as a scaling enabler beyond HBM. On an NVIDIA H100 NVL GPU, HKV achieves up to 3.9 billion key-value pairs per second (B-KV/s) find throughput, stable across load factors 0.50-1.00 (<5% variation), and delivers 1.4x higher find throughput than WarpCore (the strongest dictionary-semantic GPU baseline at lambda=0.50) and up to 2.6-9.4x over indirection-based GPU baselines. Since its open-source release in October 2022, HKV has been integrated into multiple open-source recommendation frameworks.
Paper Structure (33 sections, 4 theorems, 8 figures, 11 tables, 3 algorithms)

This paper contains 33 sections, 4 theorems, 8 figures, 11 tables, 3 algorithms.

Key Result

proposition 1

In single-bucket mode with $S{=}128$ slots per bucket, for any key $k$ not present in $T$, Find$(k)$ (Algorithm alg:find) returns NotFound after examining exactly one bucket, performing $S$ digest comparisons and at most $S \cdot 2^{-8} = 0.5$ expected full-key comparisons, using a single 128-byte m

Figures (8)

  • Figure 1: Embedding lookup pipeline in a recommendation model. Sparse categorical features are mapped to dense vectors via embedding tables, which constitute the dominant memory consumer of the model. Online training continuously ingests new keys under a hard memory budget.
  • Figure 2: Workload characteristics of continuous online embedding ingestion. (a) Load factor increases monotonically as new features arrive; without eviction, it reaches the capacity ceiling. (b) Miss ratio remains high because new (unseen) feature IDs dominate during exploration phases. (c) In open-addressing schemes, probe distance grows super-linearly beyond load factor 0.8, causing warp divergence on GPUs.
  • Figure 3: HKV architecture. Keys, digests, and scores reside in HBM; overflow values are placed in pinned host memory (HMEM) via zero-copy mapped pointers. The SSD/GDS tier (dashed) is an architectural extension point.
  • Figure 4: Memory layout of a single HKV bucket (128 slots). The digest array occupies exactly one GPU L1 cache line (128 B), enabling a complete per-bucket negative lookup in a single cache-line load. Values are addressed by bucket and slot index (position-based addressing, §\ref{['ssec:kvsep']}); no per-entry pointer is stored.
  • Figure 5: Two-phase dual-bucket selection. Phase D1 (left) inserts into the less-loaded bucket for memory utilization; Phase D2 (right, $\lambda{\approx}1.0$) evicts in the bucket with the lower minimum score, improving eviction correctness.
  • ...and 3 more figures

Theorems & Definitions (5)

  • definition 1: Cache-Semantic Hash Table
  • proposition 1: Definitive Per-Bucket Miss
  • proposition 2: Liveness
  • proposition 3: Score-Based Selection Advantage
  • proposition 4: Reader Safety