Table of Contents
Fetching ...

MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models

Kevin Zhai, Utsav Singh, Anirudh Thatipelli, Souradip Chakraborty, Anit Kumar Sahu, Furong Huang, Amrit Singh Bedi, Mubarak Shah

TL;DR

This work addresses reward hacking in inference-time alignment for text-to-image diffusion models by introducing MIRA, a training-free method that directly regularizes the output distribution with a score-based image-space KL surrogate. The core objective is $\,\mathcal{L}_{\text{MIRA}}(z,c) = - r(x_0,c) + \beta d_{\text{KL}}[ p_\theta(x_0|z,c) \\| p_\theta(x_0|z_0,c) ]$, which is tractable via a score-based upper bound involving diffusion scores. To handle non-differentiable rewards, MIRA-DPO extends the framework using Direct Preference Optimization with a principled surrogate for the reward log-ratio, enabling learning from preferences without gradient-based rewards. Empirical results on SDv1.5 and SDXL across multiple rewards and datasets show substantial gains in win rates and reduced distributional drift, with human studies indicating strong preference for MIRA. This approach offers a training-free, robust pathway to align diffusion-based image generation with complex, potentially black-box user objectives while preserving prompt fidelity.

Abstract

Diffusion models excel at generating images conditioned on text prompts, but the resulting images often do not satisfy user-specific criteria measured by scalar rewards such as Aesthetic Scores. This alignment typically requires fine-tuning, which is computationally demanding. Recently, inference-time alignment via noise optimization has emerged as an efficient alternative, modifying initial input noise to steer the diffusion denoising process towards generating high-reward images. However, this approach suffers from reward hacking, where the model produces images that score highly, yet deviate significantly from the original prompt. We show that noise-space regularization is insufficient and that preventing reward hacking requires an explicit image-space constraint. To this end, we propose MIRA (MItigating Reward hAcking), a training-free, inference-time alignment method. MIRA introduces an image-space, score-based KL surrogate that regularizes the sampling trajectory with a frozen backbone, constraining the output distribution so reward can increase without off-distribution drift (reward hacking). We derive a tractable approximation to KL using diffusion scores. Across SDv1.5 and SDXL, multiple rewards (Aesthetic, HPSv2, PickScore), and public datasets (e.g., Animal-Animal, HPDv2), MIRA achieves >60\% win rate vs. strong baselines while preserving prompt adherence; mechanism plots show reward gains with near-zero drift, whereas DNO drifts as compute increases. We further introduce MIRA-DPO, mapping preference optimization to inference time with a frozen backbone, extending MIRA to non-differentiable rewards without fine-tuning.

MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models

TL;DR

This work addresses reward hacking in inference-time alignment for text-to-image diffusion models by introducing MIRA, a training-free method that directly regularizes the output distribution with a score-based image-space KL surrogate. The core objective is , which is tractable via a score-based upper bound involving diffusion scores. To handle non-differentiable rewards, MIRA-DPO extends the framework using Direct Preference Optimization with a principled surrogate for the reward log-ratio, enabling learning from preferences without gradient-based rewards. Empirical results on SDv1.5 and SDXL across multiple rewards and datasets show substantial gains in win rates and reduced distributional drift, with human studies indicating strong preference for MIRA. This approach offers a training-free, robust pathway to align diffusion-based image generation with complex, potentially black-box user objectives while preserving prompt fidelity.

Abstract

Diffusion models excel at generating images conditioned on text prompts, but the resulting images often do not satisfy user-specific criteria measured by scalar rewards such as Aesthetic Scores. This alignment typically requires fine-tuning, which is computationally demanding. Recently, inference-time alignment via noise optimization has emerged as an efficient alternative, modifying initial input noise to steer the diffusion denoising process towards generating high-reward images. However, this approach suffers from reward hacking, where the model produces images that score highly, yet deviate significantly from the original prompt. We show that noise-space regularization is insufficient and that preventing reward hacking requires an explicit image-space constraint. To this end, we propose MIRA (MItigating Reward hAcking), a training-free, inference-time alignment method. MIRA introduces an image-space, score-based KL surrogate that regularizes the sampling trajectory with a frozen backbone, constraining the output distribution so reward can increase without off-distribution drift (reward hacking). We derive a tractable approximation to KL using diffusion scores. Across SDv1.5 and SDXL, multiple rewards (Aesthetic, HPSv2, PickScore), and public datasets (e.g., Animal-Animal, HPDv2), MIRA achieves >60\% win rate vs. strong baselines while preserving prompt adherence; mechanism plots show reward gains with near-zero drift, whereas DNO drifts as compute increases. We further introduce MIRA-DPO, mapping preference optimization to inference time with a frozen backbone, extending MIRA to non-differentiable rewards without fine-tuning.

Paper Structure

This paper contains 22 sections, 1 theorem, 41 equations, 13 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Closeness in the noise space does not imply closeness in the diffusion-induced image distribution $p_\theta(\cdot | z,c)$.

Figures (13)

  • Figure 1: Sampling-based methods vs noise optimization.(a) Best-of-$N$ draws many independent samples from the base model and picks the best. When the green (high-reward region) in the image space has low likelihood, most samples fall elsewhere, and Best-of-$N$ is not efficient. Noise optimization, in contrast, adjusts the initial noise sample, which can steer the diffusion trajectories toward the high-reward region without large $N$. But this additional flexibility (ability to optimize for any reward) comes at the price of additional reward hackability. (b) Trade-off: Best-of-$N$ has lower hackability but also lower flexibility. Whereas the existing noise optimization method (DNO tang2024tuning) has higher flexibility but is also prone to reward hacking, our method, MIRA, maintains flexibility without causing any reward hackability.
  • Figure 1: MIRA vs. DNO in reward hacking. On the image brightness reward, we demonstrate that MIRA is able to effectively mitigate reward hacking and generate better, more realistic images while maintaining prompt fidelity when compared to the state-of-the-art baseline. In the top row, after 50 optimization steps, DNO completely hacks the brightness reward and generates an image that is overly white and unrealistic. In contrast, in the bottom row, MIRA is able to mitigate reward hacking and produce much better images, while aligning with the target reward.
  • Figure 2: Illustrating reward hacking in inference-time alignment of diffusion models. (a) Given the prompt "generate an image of a fly" and a preference for better aesthetics (Aesthetic Score schuhmann2022laion), we observe the state-of-the-art (Direct Noise Optimization) achieves high reward yet no longer follows the prompt, an example of reward hacking. In contrast, our images have better aesthetic quality and do not suffer from reward hacking; hence, our reward is lower. (b) We obtain Aesthetic Score and win rate results (against base SDv1.5) on the Animal dataset ddpo. We remark that MIRA (bottom) significantly outperforms the SoTA (top) in average win-rate, effectively mitigating reward hacking despite lower average rewards. We use GPT-4o hurst2024gpt as the win-rate judge.
  • Figure 2: Win rates (vs. SDv1.5) and average rewards across methods. We evaluate our method, MIRA, on Aesthetic Score (left), brightness (middle), and darkness (right), comparing against DDPO, Diffusion-DPO, D3PO, and DNO. MIRA consistently outperforms baselines in win rates despite lower average rewards. Notably, higher rewards can indicate overoptimization, resulting in lower win rates.
  • Figure 3: Tiny changes to the initial noise can yield markedly different images under the same prompt.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Proposition 1
  • proof
  • proof