Table of Contents
Fetching ...

HAMburger: Accelerating LLM Inference via Token Smashing

Jingyu Liu, Ce Zhang

TL;DR

HAMburger tackles the fundamental bottleneck in LLM decoding: linear growth of KV-cache size and forward FLOPs with output length. It introduces a Hierarchically Auto-regressive Model that fuses multiple tokens into a single KV entry via a Compositional Embedder and generates several tokens per macro-step with a Micro-Step Decoder, effectively shifting resource usage from linear to sub-linear with output length. The method is complemented by dynamic data segmentation, implicit regularization, and a self-speculative decoding paradigm that avoids external drafting and verification costs. Empirical results on standard and long-context tasks show up to 2x reductions in KV-cache computation and up to 2x improvements in TPS, while preserving or improving accuracy, suggesting substantial practical impact for fast, memory-efficient LLM inference across hardware settings.

Abstract

The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and hardware. However, very few works have fundamentally changed the generation pattern: each token needs one forward pass and one KV cache. This can be sub-optimal because we found that LLMs are extremely capable of self-identifying the exact dose of information that a single KV cache can store, and many tokens can be generated confidently without global context. Based on this insight, we introduce HAMburger, a Hierarchically Auto-regressive Model that redefines resource allocation in LLMs by moving beyond uniform computation and storage per token during inference. Stacking a compositional embedder and a micro-step decoder in between a base LLM, HAMburger smashes multiple tokens into a single KV and generates several tokens per step. Additionally, HAMburger functions as a speculative decoding framework where it can blindly trust self-drafted tokens. As a result, HAMburger shifts the growth of KV cache and forward FLOPs from linear to sub-linear with respect to output length, and adjusts its inference speed based on query perplexity and output structure. Extensive evaluations show that HAMburger reduces the KV cache computation by up to 2$\times$ and achieves up to 2$\times$ TPS, while maintaining quality in both short- and long-context tasks. Our method explores an extremely challenging inference regime that requires both computation- and memory-efficiency with a hardware-agnostic design.

HAMburger: Accelerating LLM Inference via Token Smashing

TL;DR

HAMburger tackles the fundamental bottleneck in LLM decoding: linear growth of KV-cache size and forward FLOPs with output length. It introduces a Hierarchically Auto-regressive Model that fuses multiple tokens into a single KV entry via a Compositional Embedder and generates several tokens per macro-step with a Micro-Step Decoder, effectively shifting resource usage from linear to sub-linear with output length. The method is complemented by dynamic data segmentation, implicit regularization, and a self-speculative decoding paradigm that avoids external drafting and verification costs. Empirical results on standard and long-context tasks show up to 2x reductions in KV-cache computation and up to 2x improvements in TPS, while preserving or improving accuracy, suggesting substantial practical impact for fast, memory-efficient LLM inference across hardware settings.

Abstract

The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and hardware. However, very few works have fundamentally changed the generation pattern: each token needs one forward pass and one KV cache. This can be sub-optimal because we found that LLMs are extremely capable of self-identifying the exact dose of information that a single KV cache can store, and many tokens can be generated confidently without global context. Based on this insight, we introduce HAMburger, a Hierarchically Auto-regressive Model that redefines resource allocation in LLMs by moving beyond uniform computation and storage per token during inference. Stacking a compositional embedder and a micro-step decoder in between a base LLM, HAMburger smashes multiple tokens into a single KV and generates several tokens per step. Additionally, HAMburger functions as a speculative decoding framework where it can blindly trust self-drafted tokens. As a result, HAMburger shifts the growth of KV cache and forward FLOPs from linear to sub-linear with respect to output length, and adjusts its inference speed based on query perplexity and output structure. Extensive evaluations show that HAMburger reduces the KV cache computation by up to 2 and achieves up to 2 TPS, while maintaining quality in both short- and long-context tasks. Our method explores an extremely challenging inference regime that requires both computation- and memory-efficiency with a hardware-agnostic design.

Paper Structure

This paper contains 27 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: How HAMburger Merges Tokens: We showcase two examples of dynamically generating a "unit dose of information" per step instead of a fixed single token (left). The output tokens from the static vocabulary are distinguished by alternating blues and greens. The red dividers separate groups of tokens that are predicted with a single macro-step by HAMburger. During data pre-processing, we rely on model's own knowledge (i.e., conditional entropy) for segmentation (right).
  • Figure 2: HAMburger Overview: HAMburger stacks the base model with two additional modules that can fuse and predict multiple tokens per iteration for faster decoding.
  • Figure 3: Standard Task Evaluation: We present the standard task evaluations for our method, which consists of instruction following, math, reasoning and code. The bottom x-axis denotes the KV cache compression rate, which relates to the decoding TPS speedup. The top x-axis shows a tunable parameter that trades-off efficiency and quality. We shade the green area to be the bearable quality loss. In almost all tasks, HAMburger achieves great efficiency with minimal quality loss.
  • Figure 4: Long Context Task Evaluation: We showcase the superior long-context performance of HAMburger with LongBench suite. With the same graph settings as Figure \ref{['fig:standard']}, we can see that our method performs competitively quality-wise while significantly reducing the serving cost and latency.
  • Figure 5: Efficiency Benchmarking: We compare the decoding tokens per second for HAMburger against the baselines. For 1B models, HAMburger achieves up to 2.2$\times$ decoding TPS speedup over the base model. HAMburger also beats speculative sampling with 1B INT8 draft model with even $>$ 90% acceptance rate by speedup (up to 2.73$\times$) and memory saving.