Table of Contents
Fetching ...

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Arash Gholami Davoodi, Navid Rezazadeh, Seyed Pouyan Mousavi Davoudi, Pouya Pezeshkpour

TL;DR

Top-W introduces geometry-aware decoding for large language models by optimizing a Wasserstein-transport objective over the next-token crop, balancing semantic transport distance, cropped entropy, and retained mass via the fixed-potential framework. The method yields an exact factorization that reduces the S-step to a prefix or singleton search, enabling an efficient O(n) crop update with an anchored Lipschitz potential and a practical top_m candidate pool. An alternating decoder performs iterative f- and S-steps, with a geometry-driven bias anchored to the current crop, and a theoretical and empirical analysis showing connections to Top-$k$ and Top-$H$ under special metrics. Empirically, Top-W achieves robust improvements across reasoning, instruction-following, and creative writing benchmarks with modest runtime overhead, highlighting the value of embedding geometry in decoding and its potential to enhance both accuracy and creativity.

Abstract

Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

TL;DR

Top-W introduces geometry-aware decoding for large language models by optimizing a Wasserstein-transport objective over the next-token crop, balancing semantic transport distance, cropped entropy, and retained mass via the fixed-potential framework. The method yields an exact factorization that reduces the S-step to a prefix or singleton search, enabling an efficient O(n) crop update with an anchored Lipschitz potential and a practical top_m candidate pool. An alternating decoder performs iterative f- and S-steps, with a geometry-driven bias anchored to the current crop, and a theoretical and empirical analysis showing connections to Top- and Top- under special metrics. Empirically, Top-W achieves robust improvements across reasoning, instruction-following, and creative writing benchmarks with modest runtime overhead, highlighting the value of embedding geometry in decoding and its potential to enhance both accuracy and creativity.

Abstract

Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.
Paper Structure (62 sections, 10 theorems, 71 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 62 sections, 10 theorems, 71 equations, 3 figures, 8 tables, 1 algorithm.

Key Result

Lemma 3.1

Let $S\subseteq V$ with $\Gamma_S\in(0,1)$. Then where $p(\cdot\mid S)$ and $p(\cdot\mid S^c)$ are the conditional distributions of $p$ restricted to $S$ and $S^c$, respectively.

Figures (3)

  • Figure 1: Alpaca accuracy across temperatures for Min-$p$, Top-$p$, Top-$H$, and Top-$W$ (aggregated over 4 runs). As we can see in the bar plot, Min-$p$, Top-$p$, Top-$H$ and Top-$W$ (our method) win in 0,0,1,8 tuples of $(T,model)$ out of 9 tuples, respectively.
  • Figure 2: MT-Bench judge scores across temperatures for Min-$p$, Top-$p$, Top-$H$, and Top-$W$ (aggregated over 4 runs). As we can see in the bar plot, Min-$p$, Top-$p$, Top-$H$ and Top-$W$ (our method) win in 0,1,2,6 tuples of $(T,model)$ out of 9 tuples, respectively.
  • Figure 3: GSM8K accuracy sensitivity of Top-$W$ to $\beta$ for fixed $\lambda$s and LLaMA3.1-8B-Instruct at $T\in\{1.0,1.5,2.0\}$.

Theorems & Definitions (18)

  • Lemma 3.1: Exact factorization
  • Lemma 3.2: Fixed-$f$ $S$-step as normalized score maximization
  • Remark 3.3
  • Theorem 3.4: Exact fixed-$f$ $S$-step: prefix regime vs. singleton regime
  • Corollary 3.5: Monotonicity in $\beta$ (fixed $f$, prefix regime)
  • Lemma 4.1: Shift invariance of the fixed-$f$ $S$-step
  • Lemma 4.2: Extremal anchored Lipschitz envelopes
  • proof
  • proof
  • proof
  • ...and 8 more