Table of Contents
Fetching ...

EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models

Atula Tejaswi, Litu Rout, Constantine Caramanis, Sanjay Shakkottai, Sujay Sanghavi

TL;DR

EntRGi tackles the challenge of steering discrete diffusion language models at inference time with differentiable reward signals by introducing entropy-aware gradient guidance. It interpolates between continuous relaxations and hard token embeddings using tokenwise entropy, via $\hat{e}^l = (1-w^l)\bar{e}^l + w^l \tilde{e}^l$ with $w^l = H(q^l)/\log K$, to provide reliable reward gradients $\nabla_{\psi^l} R(\hat{e})$ while keeping the diffusion and reward models fixed. The paper presents a theoretical analysis of gradient approximation errors, defining $\mathcal{E}^l$ and $\mathcal{D}^l$, and demonstrates that EntRGi reduces early-denoising approximation error relative to prior methods. Empirically, using a 7B Dream diffusion LLM and three reward models across three benchmarks, EntRGi achieves consistent improvements over state-of-the-art baselines, with larger reward models and a moderate number of gradient steps further enhancing performance. This work demonstrates that entropy-based modulation offers a principled, training-free path to reliable reward-guided generation in discrete diffusion models, with practical implications for controllable text generation without additional fine-tuning.

Abstract

Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens. Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits. Our key innovation is to go beyond this tradeoff by introducing a novel mechanism called EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model. We empirically validate our approach on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks, showing consistent improvements over state-of-the-art methods.

EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models

TL;DR

EntRGi tackles the challenge of steering discrete diffusion language models at inference time with differentiable reward signals by introducing entropy-aware gradient guidance. It interpolates between continuous relaxations and hard token embeddings using tokenwise entropy, via with , to provide reliable reward gradients while keeping the diffusion and reward models fixed. The paper presents a theoretical analysis of gradient approximation errors, defining and , and demonstrates that EntRGi reduces early-denoising approximation error relative to prior methods. Empirically, using a 7B Dream diffusion LLM and three reward models across three benchmarks, EntRGi achieves consistent improvements over state-of-the-art baselines, with larger reward models and a moderate number of gradient steps further enhancing performance. This work demonstrates that entropy-based modulation offers a principled, training-free path to reliable reward-guided generation in discrete diffusion models, with practical implications for controllable text generation without additional fine-tuning.

Abstract

Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens. Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits. Our key innovation is to go beyond this tradeoff by introducing a novel mechanism called EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model. We empirically validate our approach on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks, showing consistent improvements over state-of-the-art methods.
Paper Structure (18 sections, 8 equations, 15 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 8 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overall pipeline of Entropy-aware Reward Guidance (EntRGi). In standard sampling methods dreamllada, the current input $z_t$ is fed to the discrete diffusion LLM (dLLM), which produces output distributions at the masked positions; the most confident tokens are then committed to obtain $z_{t-1}$. Our method EntRGi instead modifies the logits at the masked positions using gradients from a reward model, while keeping both the dLLM and the reward model frozen. The embeddings provided to the reward model at masked positions are constructed as an entropy-weighted interpolation between a continuous relaxation of the token embeddings and sampled hard token embeddings. Lower entropy proportionally emphasizes the continuous relaxation, while higher entropy increases reliance on hard tokens via a straight-through estimator stegumbel-softaps.
  • Figure 2: Average L2-norm between the soft embedding $\tilde{e}$ and the reward model input $\hat{e}$ as a function of decoding timestep, along with average entropy. The maximum possible entropy is $\log K \approx 11$. EntRGi reduces early-step approximation error compared to APS by upweighting the continuous relaxation on tokens with relatively low entropy in the predicted sequence.
  • Figure 3: Heatmaps showing the joint distribution of entropy and approximation error $\mathcal{E}^l$ for three benchmarks (RM-Bench, JudgeBench, Reward-Bench-2) using APS (top) and EntRGi (bottom). Color indicates frequency on a log scale. EntRGi upweights soft tokens based on entropy. For entropy in the range 1--4, the soft approximation $\bar{{\bm{e}}}$ is heavily preferred, trading off ${\mathcal{E}}^l$ for ${\mathcal{D}}^l$ proportionally.
  • Figure 4: LMUnit score with increasing reward model size across Reward-Bench-2 rewardbench2, RM-Bench rmbench, and JudgeBench judgebench, for $M=3$ and $\tau=0.7$. Increasing reward model size generally leads to improved performance. We observe similar trends for other metrics (Top@1, Avg@4), reported in \ref{['sec:appdx_rm_size']} in the Appendix.
  • Figure 5: Change in Top@1 accuracy and LMUnit score relative to $M=1$ as reward model gradient steps $M$ increase for EntRGi. Results are averaged over 3 reward model sizes (0.6B, 1.7B, 4B). Optimal $M$ is dataset-dependent (our experiments use $M=3$ for all datasets). LMUnit collapses beyond $M=4$, indicating overoptimization. Raw scores are reported in \ref{['sec:appdx_m_scaling']} in the Appendix.
  • ...and 10 more figures