Table of Contents
Fetching ...

Text Generation Beyond Discrete Token Sampling

Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, Jianfeng Gao

TL;DR

Autoregressive generation typically discards the full next-token distribution, potentially limiting multi-step reasoning. MoI presents a training-free approach that preserves distributional information by forming a posterior mixture of token embeddings: after computing the next-token distribution $\boldsymbol{p}_t$ and sampling $y_t$, it derives mixing weights $\boldsymbol{w}_t$ from a Dirichlet–Multinomial model with $\alpha_i = H(\boldsymbol{p}_t) p_{t,i}$, pseudo-counts $c_i=(\beta+1-H) y_{t,i}$, and posterior mean $w_{t,i} = \frac{H p_{t,i} + (\beta+1-H) y_{t,i}}{\beta+1}$. The mixed embedding $\boldsymbol{h}_t = \sum_i w_{t,i} \boldsymbol{e}_i$ replaces the standard one-hot embedding for the next step, enabling an internal, probabilistic discourse without retraining. Evaluated on AIME, Count Down 4, GPQA-Diamond, and LiveCodeBench across four open-source LLMs (QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, DAPO-Qwen-32B), MoI yields consistent gains with an average absolute improvement of about $1.8\%$ and larger gains on symbolic and multi-step tasks; results remain robust under McNemar's test and across 64 random seeds. Importantly, MoI incurs negligible overhead and no architectural changes, suggesting a practical path to distribution-aware decoding that enhances reasoning and code-generation capabilities in existing models.

Abstract

In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.

Text Generation Beyond Discrete Token Sampling

TL;DR

Autoregressive generation typically discards the full next-token distribution, potentially limiting multi-step reasoning. MoI presents a training-free approach that preserves distributional information by forming a posterior mixture of token embeddings: after computing the next-token distribution and sampling , it derives mixing weights from a Dirichlet–Multinomial model with , pseudo-counts , and posterior mean . The mixed embedding replaces the standard one-hot embedding for the next step, enabling an internal, probabilistic discourse without retraining. Evaluated on AIME, Count Down 4, GPQA-Diamond, and LiveCodeBench across four open-source LLMs (QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, DAPO-Qwen-32B), MoI yields consistent gains with an average absolute improvement of about and larger gains on symbolic and multi-step tasks; results remain robust under McNemar's test and across 64 random seeds. Importantly, MoI incurs negligible overhead and no architectural changes, suggesting a practical path to distribution-aware decoding that enhances reasoning and code-generation capabilities in existing models.

Abstract

In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.

Paper Structure

This paper contains 46 sections, 6 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of the regular autoregressive generation pipeline (left) and our proposed Mixture of Inputs (MoI) strategy (right). In regular generation, only the discrete sampled token is fed back at each step, whereas MoI preserves the full sampling distribution by computing a blended embedding $h = \sum_i w_i e_i$, with weights $w_i$ interpolating embeddings $\{e_i\}_{i=1}^V$, letting the model consider several plausible tokens simultaneously within a single forward pass.
  • Figure 2: Hyperparameter Importance Analysis. Comparison of three key hyperparameters ($\beta$ in MoI, top-p, and temperature) across four LLMs on two mathematical reasoning tasks. Left: Expected performance gain (%) when optimizing each hyperparameter individually through best-of-N-shots tuning. The graph shows $\beta$ consistently outperforms other parameters as N increases. Right: Relative feature importance derived from random forest regression analysis, confirming $\beta$'s strong influence (0.41) on model performance compared to top-p (0.32) and temperature (0.27). These results demonstrate that $\beta$ is highly influential for effectively controlling input mixing during chain-of-thought reasoning.
  • Figure 3: Task-dependent Optimal Mixing Strategies. The plot shows accuracy deviation from task mean across different $\beta$ values for AIME (reasoning-heavy) and Count Down 4 (enumeration-heavy), averaged across four LLMs. Lower $\beta$ values ($\beta \leq 1$) significantly benefit AIME's performance while higher $\beta$ values ($\beta\!>\!1$) improve Count Down 4. This divergence demonstrates how MoI's impact varies based on task characteristics: reasoning-intensive tasks perform better with stronger distribution mixing (low $\beta$) to be more creative, while enumeration-intensive tasks benefit from higher distribution mixing (high $\beta$) that helps explore the combinatorial search space with more focus.
  • Figure A1: We show a comparison of distributions of evaluation results across the best top-$p$ and temperature hyperparameter for baseline and with MoI. The results indicate strong performance gain brought by incorporating the sampling distribution in the generation process.
  • Figure A2: We show a comparison of distributions of evaluation results across all top-$p$ and temperature hyperparameters. The results indicate almost universal performance gain across average hyperparameter settings.