Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Xiaotong Ji; Rasul Tutunov; Matthieu Zimmer; Haitham Bou-Ammar

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou-Ammar

TL;DR

This work reframes language-model decoding as a principled optimisation on the probability simplex, where the next-token distribution $q_t$ is chosen to maximise $\langle q_t, s_t\rangle$ minus a regulariser $\lambda\Omega(q_t)$ subject to simplex constraints. By deriving the KKT conditions, the authors show that classic decoders (greedy, softmax, Top-K, Top-P, Sparsemax) arise as special cases, thereby unifying decoding methods as regulariser-driven optima with a common master objective. They introduce mirror ascent as a practical solver for general regularisers on the simplex and present BoK, a KL-anchored, coverage-based regulariser for multi-sample decoding, which improves accuracy notably at high sampling temperatures with modest computational overhead. Empirical results on 7B Qwen variants across MATH500, GPQA, and HumanEval demonstrate BoK's robust gains and practicality, illustrating the framework’s potential to design new decoders beyond folklore. The work positions decoding as a design problem in objective space, enabling principled, plug-in improvements for multi-sample pipelines and downstream verification tasks.

Abstract

Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a regularised problem over the probability simplex that trades off model score against structural preferences and constraints. This single template recovers greedy decoding, Softmax sampling, Top-K, Top-P, and Sparsemax-style sparsity as special cases, and explains their common structure through optimality conditions. More importantly, the framework makes it easy to invent new decoders without folklore. We demonstrate this by designing Best-of-K (BoK), a KL-anchored coverage objective aimed at multi-sample pipelines (self-consistency, reranking, verifier selection). BoK targets the probability of covering good alternatives within a fixed K-sample budget and improves empirical performance. We show that such samples can improve accuracy by, for example, +18.6% for Qwen2.5-Math-7B on MATH500 at high sampling temperatures.

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

TL;DR

This work reframes language-model decoding as a principled optimisation on the probability simplex, where the next-token distribution

is chosen to maximise

minus a regulariser

subject to simplex constraints. By deriving the KKT conditions, the authors show that classic decoders (greedy, softmax, Top-K, Top-P, Sparsemax) arise as special cases, thereby unifying decoding methods as regulariser-driven optima with a common master objective. They introduce mirror ascent as a practical solver for general regularisers on the simplex and present BoK, a KL-anchored, coverage-based regulariser for multi-sample decoding, which improves accuracy notably at high sampling temperatures with modest computational overhead. Empirical results on 7B Qwen variants across MATH500, GPQA, and HumanEval demonstrate BoK's robust gains and practicality, illustrating the framework’s potential to design new decoders beyond folklore. The work positions decoding as a design problem in objective space, enabling principled, plug-in improvements for multi-sample pipelines and downstream verification tasks.

Abstract

Paper Structure (28 sections, 82 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 28 sections, 82 equations, 1 figure, 3 tables, 1 algorithm.

Introduction
Decoding, Sampling and Optimisation
Decoding as a Decision Over Distributions
Decoding as Distributional Optimisation
Interpreting Equation \ref{['Eq:Master']}.
Deriving Solution Conditions for Equation \ref{['Eq:Master']}.
Getting the Inequality Back.
LLM Decoding Strategies are Different Regularisers
Greedy Decoding: The Boring but Necessary Case
From Negative Entropy to Softmax: A Predictable Ending
Trimming the Vocabulary: Top-K Samplers
Mind the Mass: Top-P Sampling
Defining the nucleus.
Letting Probabilities Go to Zero: Sparsemax Decoding
Going Beyond Current Decoders
...and 13 more sections

Figures (1)

Figure 1: Framework of Decoding as Optimisation: The master objective generalises standard LLM decoding strategies. By choosing appropriate $\lambda, \Omega(q)$ and $\mathcal{C}_t$, we can recover current decoding strategies as special cases.

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

TL;DR

Abstract

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Authors

TL;DR

Abstract

Table of Contents

Figures (1)