Text Generation Beyond Discrete Token Sampling
Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, Jianfeng Gao
TL;DR
Autoregressive generation typically discards the full next-token distribution, potentially limiting multi-step reasoning. MoI presents a training-free approach that preserves distributional information by forming a posterior mixture of token embeddings: after computing the next-token distribution $\boldsymbol{p}_t$ and sampling $y_t$, it derives mixing weights $\boldsymbol{w}_t$ from a Dirichlet–Multinomial model with $\alpha_i = H(\boldsymbol{p}_t) p_{t,i}$, pseudo-counts $c_i=(\beta+1-H) y_{t,i}$, and posterior mean $w_{t,i} = \frac{H p_{t,i} + (\beta+1-H) y_{t,i}}{\beta+1}$. The mixed embedding $\boldsymbol{h}_t = \sum_i w_{t,i} \boldsymbol{e}_i$ replaces the standard one-hot embedding for the next step, enabling an internal, probabilistic discourse without retraining. Evaluated on AIME, Count Down 4, GPQA-Diamond, and LiveCodeBench across four open-source LLMs (QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, DAPO-Qwen-32B), MoI yields consistent gains with an average absolute improvement of about $1.8\%$ and larger gains on symbolic and multi-step tasks; results remain robust under McNemar's test and across 64 random seeds. Importantly, MoI incurs negligible overhead and no architectural changes, suggesting a practical path to distribution-aware decoding that enhances reasoning and code-generation capabilities in existing models.
Abstract
In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.
