CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference
Dong Liu, Yanxuan Yu, Ben Lengerich
TL;DR
CSV-Decode addresses the output-layer bottleneck in large language models by constructing certifiable sub-vocabularies through offline vocabulary clustering and centroid-plus-radius geometric bounds. It provides two provable guarantees—exact top-$k$ certification and $\varepsilon$-certified softmax—via an online adaptive algorithm that expands sub-vocabularies only as needed. The system showcases a full implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization, achieving 2.7–5× speedups while maintaining high quality and low fallback rates across diverse models and tasks, and reducing energy consumption significantly. The work demonstrates the practicality of geometric pruning for efficient, reliable inference and outlines avenues for adaptive clustering and integration with other efficiency techniques.
Abstract
Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top-$k$ certification and $\varepsilon$-certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation available at \href{https://github.com/FastLM/CSV-Decode}{https://github.com/FastLM/CSV-Decode}.
