The End of Manual Decoding: Towards Truly End-to-End Language Models

Zhichao Wang; Dongyang Ma; Xinting Huang; Deng Cai; Tian Lan; Jiahao Xu; Haitao Mi; Xiaoying Tang; Yan Wang

The End of Manual Decoding: Towards Truly End-to-End Language Models

Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang

TL;DR

The paper targets the misperception that language-model decoding is truly end-to-end, highlighting how static, hand-tuned decoding hyperparameters limit performance. It introduces AutoDeco, a lightweight extension that adds per-token prediction heads for $\hat{T_t}$ and $\hat{P_t}$ and uses a differentiable soft top-p to produce a final distribution $\tilde{\mathbf{p}}$ within a single forward pass, achieving near-zero additional latency. Across eight benchmarks and multiple model families, AutoDeco consistently outperforms default decoding and matches the performance of oracle-tuned static configurations, while enabling an emergent ability to interpret natural-language commands to steer decoding. The work also demonstrates a practical, drop-in deployment path with minimal overhead and opens a path toward steerable, interactive decoding by translating user intent into internal sampling parameters. Overall, AutoDeco advances truly end-to-end generation and suggests a scalable route to dynamic, user-driven control of LLM outputs.

Abstract

The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.

The End of Manual Decoding: Towards Truly End-to-End Language Models

TL;DR

Abstract

The End of Manual Decoding: Towards Truly End-to-End Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)