Table of Contents
Fetching ...

CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

Dong Liu, Yanxuan Yu, Ben Lengerich

TL;DR

CSV-Decode addresses the output-layer bottleneck in large language models by constructing certifiable sub-vocabularies through offline vocabulary clustering and centroid-plus-radius geometric bounds. It provides two provable guarantees—exact top-$k$ certification and $\varepsilon$-certified softmax—via an online adaptive algorithm that expands sub-vocabularies only as needed. The system showcases a full implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization, achieving 2.7–5× speedups while maintaining high quality and low fallback rates across diverse models and tasks, and reducing energy consumption significantly. The work demonstrates the practicality of geometric pruning for efficient, reliable inference and outlines avenues for adaptive clustering and integration with other efficiency techniques.

Abstract

Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top-$k$ certification and $\varepsilon$-certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation available at \href{https://github.com/FastLM/CSV-Decode}{https://github.com/FastLM/CSV-Decode}.

CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

TL;DR

CSV-Decode addresses the output-layer bottleneck in large language models by constructing certifiable sub-vocabularies through offline vocabulary clustering and centroid-plus-radius geometric bounds. It provides two provable guarantees—exact top- certification and -certified softmax—via an online adaptive algorithm that expands sub-vocabularies only as needed. The system showcases a full implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization, achieving 2.7–5× speedups while maintaining high quality and low fallback rates across diverse models and tasks, and reducing energy consumption significantly. The work demonstrates the practicality of geometric pruning for efficient, reliable inference and outlines avenues for adaptive clustering and integration with other efficiency techniques.

Abstract

Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top- certification and -certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation available at \href{https://github.com/FastLM/CSV-Decode}{https://github.com/FastLM/CSV-Decode}.

Paper Structure

This paper contains 38 sections, 30 equations, 4 figures, 9 tables, 2 algorithms.

Figures (4)

  • Figure 1: System Configuration Analysis. (a) illustrates the complex interaction between vocabulary size and clustering parameters on overall speedup, (b) shows multi-dimensional trade-offs across different configuration choices.
  • Figure 2: Performance Analysis Across Models and Metrics. (a) shows consistent speedup gains with low variance, (b) demonstrates superior latency distribution, (c) validates significant energy savings.
  • Figure 3: Vocabulary size impact on speedup and sub-vocabulary size vs fallback rate.
  • Figure 4: Certification and Robustness Analysis. (a) demonstrates adaptive behavior across domains, (b) shows bound quality improvement with context, (c) provides comprehensive multi-metric comparison.