Table of Contents
Fetching ...

The End of Manual Decoding: Towards Truly End-to-End Language Models

Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang

TL;DR

The paper targets the misperception that language-model decoding is truly end-to-end, highlighting how static, hand-tuned decoding hyperparameters limit performance. It introduces AutoDeco, a lightweight extension that adds per-token prediction heads for $\hat{T_t}$ and $\hat{P_t}$ and uses a differentiable soft top-p to produce a final distribution $\tilde{\mathbf{p}}$ within a single forward pass, achieving near-zero additional latency. Across eight benchmarks and multiple model families, AutoDeco consistently outperforms default decoding and matches the performance of oracle-tuned static configurations, while enabling an emergent ability to interpret natural-language commands to steer decoding. The work also demonstrates a practical, drop-in deployment path with minimal overhead and opens a path toward steerable, interactive decoding by translating user intent into internal sampling parameters. Overall, AutoDeco advances truly end-to-end generation and suggests a scalable route to dynamic, user-driven control of LLM outputs.

Abstract

The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.

The End of Manual Decoding: Towards Truly End-to-End Language Models

TL;DR

The paper targets the misperception that language-model decoding is truly end-to-end, highlighting how static, hand-tuned decoding hyperparameters limit performance. It introduces AutoDeco, a lightweight extension that adds per-token prediction heads for and and uses a differentiable soft top-p to produce a final distribution within a single forward pass, achieving near-zero additional latency. Across eight benchmarks and multiple model families, AutoDeco consistently outperforms default decoding and matches the performance of oracle-tuned static configurations, while enabling an emergent ability to interpret natural-language commands to steer decoding. The work also demonstrates a practical, drop-in deployment path with minimal overhead and opens a path toward steerable, interactive decoding by translating user intent into internal sampling parameters. Overall, AutoDeco advances truly end-to-end generation and suggests a scalable route to dynamic, user-driven control of LLM outputs.

Abstract

The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.

Paper Structure

This paper contains 31 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: An overview of our proposed end-to-end decoding architecture compared to manual decoding. Our method dynamically predicts temperature and top-p values from the model's hidden states for each generation step. In contrast, manual decoding (bottom) relies on a single set of static, predefined hyperparameters for the entire sequence generation.
  • Figure 2: Comparison of the differentiable soft top-p sampling (decay steepness $\alpha=30$) with the standard hard-cutoff method. (a) illustrates the standard hard-cutoff mask, which has a non-differentiable step, against our proposed smooth and differentiable soft mask. (b) shows the effect of applying both masks to an example original probability distribution, where the soft mask method produces a differentiable probability distribution suitable for "end-to-end" training.
  • Figure 3: Expert-Guided Tuning Comparison with Search Interval of 0.1. Temperature is adjusted first (setting top-p to 1.0), and the selection is made based on the best performance of temperature to conduct the search for top-p. AutoDeco achieves competitive performance without requiring any prior empirical tuning or domain-specific expert knowledge.
  • Figure 4: Ablation study on AutoDeco architecture designs. Joint optimization achieves the highest AIME Score.
  • Figure 5: An Emergent Phenomenon. This figure shows the token-level $\hat{T}/\hat{P}$ predictions for the same prompt under three conditions, observed without any targeted training. (Left) Baseline: The model's default dynamic $\hat{T}/\hat{P}$ values. (Middle) High-Diversity Command: The model spontaneously elevates its $\hat{T}/\hat{P}$ predictions. (Right) Low-Diversity Command: The model spontaneously suppresses its $\hat{T}/\hat{P}$ predictions.
  • ...and 1 more figures