Table of Contents
Fetching ...

Foundations of Top-$k$ Decoding For Language Models

Georgy Noarov, Soham Mallick, Tao Wang, Sunay Joshi, Yan Sun, Yangxinyu Xie, Mengxin Yu, Edgar Dobriban

TL;DR

This work provides a theoretical foundation for top-$k$ decoding in language models by modeling decoding as recovering a sparse distribution through sparsity-regularized, separable Bregman divergences. It introduces primal and dual Bregman decoding frameworks with $\ell_0$ regularization, proving that optimal supports are greedy top-$k$ sets and that the cost in $k$ is discretely convex, enabling efficient adaptive-$k$ search. The authors develop renormalization maps for fixed sparsity patterns and establish conditions under which both primal and dual decoding are tractable, including an alpha-entropy family of decoders that generalize top-$k$ (recovered at $\alpha=1$) and exhibit varied mass-shifting behavior. Experiments on open-ended generation and math reasoning show competitive performance with standard top-$k$ decoding and demonstrate the practical viability of adaptive sparsity and alpha-based strategies for decoding in large language models.

Abstract

Top-$k$ decoding is a widely used method for sampling from LLMs: at each token, only the largest $k$ next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top-$k$ and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-$k$ decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-$k$ decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We consider \emph{Bregman decoders} obtained by minimizing a separable Bregman divergence (for both the \emph{primal} and \emph{dual} cases) with a sparsity-inducing $\ell_0$ regularization. Despite the combinatorial nature of the objective, we show how to optimize it efficiently for a large class of divergences. We show that the optimal decoding strategies are greedy, and further that the loss function is discretely convex in $k$, so that binary search provably and efficiently finds the optimal $k$. We show that top-$k$ decoding arises as a special case for the KL divergence, and identify new decoding strategies that have distinct behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).

Foundations of Top-$k$ Decoding For Language Models

TL;DR

This work provides a theoretical foundation for top- decoding in language models by modeling decoding as recovering a sparse distribution through sparsity-regularized, separable Bregman divergences. It introduces primal and dual Bregman decoding frameworks with regularization, proving that optimal supports are greedy top- sets and that the cost in is discretely convex, enabling efficient adaptive- search. The authors develop renormalization maps for fixed sparsity patterns and establish conditions under which both primal and dual decoding are tractable, including an alpha-entropy family of decoders that generalize top- (recovered at ) and exhibit varied mass-shifting behavior. Experiments on open-ended generation and math reasoning show competitive performance with standard top- decoding and demonstrate the practical viability of adaptive sparsity and alpha-based strategies for decoding in large language models.

Abstract

Top- decoding is a widely used method for sampling from LLMs: at each token, only the largest next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top- and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top- decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top- decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We consider \emph{Bregman decoders} obtained by minimizing a separable Bregman divergence (for both the \emph{primal} and \emph{dual} cases) with a sparsity-inducing regularization. Despite the combinatorial nature of the objective, we show how to optimize it efficiently for a large class of divergences. We show that the optimal decoding strategies are greedy, and further that the loss function is discretely convex in , so that binary search provably and efficiently finds the optimal . We show that top- decoding arises as a special case for the KL divergence, and identify new decoding strategies that have distinct behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).

Paper Structure

This paper contains 48 sections, 14 theorems, 100 equations, 9 figures, 3 tables, 4 algorithms.

Key Result

Theorem 3.2

The primal Bregman decoding strategy from sp-br is greedy for any primal valid potential $\phi$.

Figures (9)

  • Figure 1: Illustration of the landscape of the sparse Bregman objective for the primal (left) and dual (right) cases. We choose a $V=3$ dimensional example where the target vector is $p = (0.1, 0.01, 0.001)/0.111$. We show an $\alpha$-Bregman divergence (see Section \ref{['ex']}) with $\alpha=10$ and $\lambda=0.01$.
  • Figure 2: Comparison of primal (left) and dual (right) Bregman $\alpha$-renormalization maps (see Section \ref{['ex']}) on input vector $x = \frac{0.67}{\sum_{i=1}^k \frac{i}{k}}\left[1, \frac{k-1}{k}, \ldots, \frac{1}{k} \right] \in \Delta_{\mathrm{sub},k}$ with $k=100$. We plot the renormalized values against the original coordinate values of $x$.
  • Figure 3: Perplexity and repetition frequency differences between generated and human-written text for GPT2-large (left two panels) and LLaMA 3.1 8B (right two panels), for various $k$ values. We show top-$k$ decoding and primal decoding with $\alpha \in \{1.5, 2.0\}$. Standard deviations are estimated using 1000 bootstrap resamples.
  • Figure 4: Comparison of primal and dual renormalization maps: The transformation of the larger value ($0.1$, left) and of the smaller value ($0.001$, right).
  • Figure 5: Nonconvexity of the Bregman dual landscape on the square $(x, y) \in [0, 1]^2$.
  • ...and 4 more figures

Theorems & Definitions (27)

  • Definition 2.1: Renormalization
  • Definition 2.2: Generalized top-$k$ decoding
  • Definition 3.1: Greedy decoding
  • Theorem 3.2: Primal Bregman decoding is greedy
  • Theorem 3.3: Dual Bregman decoding is greedy
  • Theorem 3.4: Discrete primal and dual cost convexity
  • Definition 4.1: Primal Bregman $\alpha$-decoding
  • Proposition 4.2: Special primal $\alpha$-renormalization maps
  • Lemma 4.2
  • Theorem A.1: Uniqueness and formula for dual Bregman renormalization
  • ...and 17 more