HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

Haidong Rong; Jiashu Yao; Matthias Langer; Shijie Liu; Li Fan; Dongxin Wang; Jia He; Jinglin Chen; Jiaheng Rang; Julian Qian; Mengyao Xu; Fan Yu; Minseok Lee; Zehuan Wang; Even Oldridge

HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

Haidong Rong, Jiashu Yao, Matthias Langer, Shijie Liu, Li Fan, Dongxin Wang, Jia He, Jinglin Chen, Jiaheng Rang, Julian Qian, Mengyao Xu, Fan Yu, Minseok Lee, Zehuan Wang, Even Oldridge

Abstract

Traditional GPU hash tables preserve every inserted key -- a dictionary assumption that wastes scarce High Bandwidth Memory (HBM) when embedding tables routinely exceed single-GPU capacity. We challenge this assumption with cache semantics, where policy-driven eviction is a first-class operation. We introduce HierarchicalKV (HKV), the first general-purpose GPU hash table library whose normal full-capacity operating contract is cache-semantic: each full-bucket upsert (update-or-insert) is resolved in place by eviction or admission rejection rather than by rehashing or capacity-induced failure. HKV co-designs four core mechanisms -- cache-line-aligned buckets, in-line score-driven upsert, score-based dynamic dual-bucket selection, and triple-group concurrency -- and uses tiered key-value separation as a scaling enabler beyond HBM. On an NVIDIA H100 NVL GPU, HKV achieves up to 3.9 billion key-value pairs per second (B-KV/s) find throughput, stable across load factors 0.50-1.00 (<5% variation), and delivers 1.4x higher find throughput than WarpCore (the strongest dictionary-semantic GPU baseline at lambda=0.50) and up to 2.6-9.4x over indirection-based GPU baselines. Since its open-source release in October 2022, HKV has been integrated into multiple open-source recommendation frameworks.

HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

Abstract

Paper Structure (33 sections, 4 theorems, 8 figures, 11 tables, 3 algorithms)

This paper contains 33 sections, 4 theorems, 8 figures, 11 tables, 3 algorithms.

Introduction
Background and Motivation
Embedding Storage in Recommendation Systems
GPU Hash Table Design Space
Workload Analysis: Continuous Online Ingestion
Challenges
Cache-Semantic Hash Tables.
System Design
Architecture Overview
Single-Bucket-Confined Cache-Line-Aligned Design
Score-Driven Built-in Eviction
Score-Based Dynamic Dual-Bucket Selection
Triple-Group Concurrency Control
Tiered Key-Value Separation
Implementation
...and 18 more sections

Key Result

proposition 1

In single-bucket mode with $S{=}128$ slots per bucket, for any key $k$ not present in $T$, Find$(k)$ (Algorithm alg:find) returns NotFound after examining exactly one bucket, performing $S$ digest comparisons and at most $S \cdot 2^{-8} = 0.5$ expected full-key comparisons, using a single 128-byte m

Figures (8)

Figure 1: Embedding lookup pipeline in a recommendation model. Sparse categorical features are mapped to dense vectors via embedding tables, which constitute the dominant memory consumer of the model. Online training continuously ingests new keys under a hard memory budget.
Figure 2: Workload characteristics of continuous online embedding ingestion. (a) Load factor increases monotonically as new features arrive; without eviction, it reaches the capacity ceiling. (b) Miss ratio remains high because new (unseen) feature IDs dominate during exploration phases. (c) In open-addressing schemes, probe distance grows super-linearly beyond load factor 0.8, causing warp divergence on GPUs.
Figure 3: HKV architecture. Keys, digests, and scores reside in HBM; overflow values are placed in pinned host memory (HMEM) via zero-copy mapped pointers. The SSD/GDS tier (dashed) is an architectural extension point.
Figure 4: Memory layout of a single HKV bucket (128 slots). The digest array occupies exactly one GPU L1 cache line (128 B), enabling a complete per-bucket negative lookup in a single cache-line load. Values are addressed by bucket and slot index (position-based addressing, §\ref{['ssec:kvsep']}); no per-entry pointer is stored.
Figure 5: Two-phase dual-bucket selection. Phase D1 (left) inserts into the less-loaded bucket for memory utilization; Phase D2 (right, $\lambda{\approx}1.0$) evicts in the bucket with the lower minimum score, improving eviction correctness.
...and 3 more figures

Theorems & Definitions (5)

definition 1: Cache-Semantic Hash Table
proposition 1: Definitive Per-Bucket Miss
proposition 2: Liveness
proposition 3: Score-Based Selection Advantage
proposition 4: Reader Safety

HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

Abstract

HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

Authors

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (5)