Table of Contents
Fetching ...

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou-Ammar

TL;DR

This work reframes language-model decoding as a principled optimisation on the probability simplex, where the next-token distribution $q_t$ is chosen to maximise $\langle q_t, s_t\rangle$ minus a regulariser $\lambda\Omega(q_t)$ subject to simplex constraints. By deriving the KKT conditions, the authors show that classic decoders (greedy, softmax, Top-K, Top-P, Sparsemax) arise as special cases, thereby unifying decoding methods as regulariser-driven optima with a common master objective. They introduce mirror ascent as a practical solver for general regularisers on the simplex and present BoK, a KL-anchored, coverage-based regulariser for multi-sample decoding, which improves accuracy notably at high sampling temperatures with modest computational overhead. Empirical results on 7B Qwen variants across MATH500, GPQA, and HumanEval demonstrate BoK's robust gains and practicality, illustrating the framework’s potential to design new decoders beyond folklore. The work positions decoding as a design problem in objective space, enabling principled, plug-in improvements for multi-sample pipelines and downstream verification tasks.

Abstract

Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a regularised problem over the probability simplex that trades off model score against structural preferences and constraints. This single template recovers greedy decoding, Softmax sampling, Top-K, Top-P, and Sparsemax-style sparsity as special cases, and explains their common structure through optimality conditions. More importantly, the framework makes it easy to invent new decoders without folklore. We demonstrate this by designing Best-of-K (BoK), a KL-anchored coverage objective aimed at multi-sample pipelines (self-consistency, reranking, verifier selection). BoK targets the probability of covering good alternatives within a fixed K-sample budget and improves empirical performance. We show that such samples can improve accuracy by, for example, +18.6% for Qwen2.5-Math-7B on MATH500 at high sampling temperatures.

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

TL;DR

This work reframes language-model decoding as a principled optimisation on the probability simplex, where the next-token distribution is chosen to maximise minus a regulariser subject to simplex constraints. By deriving the KKT conditions, the authors show that classic decoders (greedy, softmax, Top-K, Top-P, Sparsemax) arise as special cases, thereby unifying decoding methods as regulariser-driven optima with a common master objective. They introduce mirror ascent as a practical solver for general regularisers on the simplex and present BoK, a KL-anchored, coverage-based regulariser for multi-sample decoding, which improves accuracy notably at high sampling temperatures with modest computational overhead. Empirical results on 7B Qwen variants across MATH500, GPQA, and HumanEval demonstrate BoK's robust gains and practicality, illustrating the framework’s potential to design new decoders beyond folklore. The work positions decoding as a design problem in objective space, enabling principled, plug-in improvements for multi-sample pipelines and downstream verification tasks.

Abstract

Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a regularised problem over the probability simplex that trades off model score against structural preferences and constraints. This single template recovers greedy decoding, Softmax sampling, Top-K, Top-P, and Sparsemax-style sparsity as special cases, and explains their common structure through optimality conditions. More importantly, the framework makes it easy to invent new decoders without folklore. We demonstrate this by designing Best-of-K (BoK), a KL-anchored coverage objective aimed at multi-sample pipelines (self-consistency, reranking, verifier selection). BoK targets the probability of covering good alternatives within a fixed K-sample budget and improves empirical performance. We show that such samples can improve accuracy by, for example, +18.6% for Qwen2.5-Math-7B on MATH500 at high sampling temperatures.
Paper Structure (28 sections, 82 equations, 1 figure, 3 tables, 1 algorithm)