Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Arash Gholami Davoodi; Navid Rezazadeh; Seyed Pouyan Mousavi Davoudi; Pouya Pezeshkpour

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Arash Gholami Davoodi, Navid Rezazadeh, Seyed Pouyan Mousavi Davoudi, Pouya Pezeshkpour

TL;DR

Top-W introduces geometry-aware decoding for large language models by optimizing a Wasserstein-transport objective over the next-token crop, balancing semantic transport distance, cropped entropy, and retained mass via the fixed-potential framework. The method yields an exact factorization that reduces the S-step to a prefix or singleton search, enabling an efficient O(n) crop update with an anchored Lipschitz potential and a practical top_m candidate pool. An alternating decoder performs iterative f- and S-steps, with a geometry-driven bias anchored to the current crop, and a theoretical and empirical analysis showing connections to Top-$k$ and Top-$H$ under special metrics. Empirically, Top-W achieves robust improvements across reasoning, instruction-following, and creative writing benchmarks with modest runtime overhead, highlighting the value of embedding geometry in decoding and its potential to enhance both accuracy and creativity.

Abstract

Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

TL;DR

and Top-

under special metrics. Empirically, Top-W achieves robust improvements across reasoning, instruction-following, and creative writing benchmarks with modest runtime overhead, highlighting the value of embedding geometry in decoding and its potential to enhance both accuracy and creativity.

Abstract

Paper Structure (62 sections, 10 theorems, 71 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 62 sections, 10 theorems, 71 equations, 3 figures, 8 tables, 1 algorithm.

Introduction
Preliminaries
Embedding-induced geometry.
Embedding-induced ground metric.
Model distribution, cropping, and entropy.
Kantorovich--Rubinstein (KR) dual objects.
Top-$W$ Decoding
Wasserstein--Entropy--Mass Objective and an Exact Factorization
Dual Surrogate and an Exact $S$-Step for Fixed Potential
Proof idea (main body).
Computing Potentials and the Implemented Alternating Decoder
A Simple Geometry-Anchored Feasible Potential
Alternating Decoder: Exact $S$-step Inside a Practical Loop
Candidate pool (top_m).
Uniform-metric reductions
...and 47 more sections

Key Result

Lemma 3.1

Let $S\subseteq V$ with $\Gamma_S\in(0,1)$. Then where $p(\cdot\mid S)$ and $p(\cdot\mid S^c)$ are the conditional distributions of $p$ restricted to $S$ and $S^c$, respectively.

Figures (3)

Figure 1: Alpaca accuracy across temperatures for Min-$p$, Top-$p$, Top-$H$, and Top-$W$ (aggregated over 4 runs). As we can see in the bar plot, Min-$p$, Top-$p$, Top-$H$ and Top-$W$ (our method) win in 0,0,1,8 tuples of $(T,model)$ out of 9 tuples, respectively.
Figure 2: MT-Bench judge scores across temperatures for Min-$p$, Top-$p$, Top-$H$, and Top-$W$ (aggregated over 4 runs). As we can see in the bar plot, Min-$p$, Top-$p$, Top-$H$ and Top-$W$ (our method) win in 0,1,2,6 tuples of $(T,model)$ out of 9 tuples, respectively.
Figure 3: GSM8K accuracy sensitivity of Top-$W$ to $\beta$ for fixed $\lambda$s and LLaMA3.1-8B-Instruct at $T\in\{1.0,1.5,2.0\}$.

Theorems & Definitions (18)

Lemma 3.1: Exact factorization
Lemma 3.2: Fixed-$f$ $S$-step as normalized score maximization
Remark 3.3
Theorem 3.4: Exact fixed-$f$ $S$-step: prefix regime vs. singleton regime
Corollary 3.5: Monotonicity in $\beta$ (fixed $f$, prefix regime)
Lemma 4.1: Shift invariance of the fixed-$f$ $S$-step
Lemma 4.2: Extremal anchored Lipschitz envelopes
proof
proof
proof
...and 8 more

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

TL;DR

Abstract

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (18)