Table of Contents
Fetching ...

Projected Gradient Ascent for Efficient Reward-Guided Updates with One-Step Generative Models

Jisung Hwang, Minhyuk Sung

TL;DR

The paper tackles the practicality of reward-guided generation with one-step generative models by addressing reward hacking and inefficiency in test-time latent optimization. It replaces soft regularization with hard white Gaussian noise constraints enforced via a closed-form projection onto a carefully designed feasible set, leveraging a bijective mapping to a compact spectral domain to enable an $O(N \log N)$ projection per iteration. Empirical results on one-step text-to-image models show higher target rewards and preserved human-aligned quality across multiple reward models, with substantially reduced wall-clock time compared to regularization-based baselines. The method also clarifies connections to prior regularization approaches, arguing that the hard-constraint formulation yields tighter Gaussian statistics and more reliable optimization, thereby making reward-guided generation more practical at deployment. Overall, the work provides a principled, efficient framework for robust test-time optimization in high-dimensional latent spaces with potential for broad applicability, while acknowledging safety considerations in reward design and deployment.

Abstract

We propose a constrained latent optimization method for reward-guided generation that preserves white Gaussian noise characteristics with negligible overhead. Test-time latent optimization can unlock substantially better reward-guided generations from pretrained generative models, but it is prone to reward hacking that degrades quality and also too slow for practical use. In this work, we make test-time optimization both efficient and reliable by replacing soft regularization with hard white Gaussian noise constraints enforced via projected gradient ascent. Our method applies a closed-form projection after each update to keep the latent vector explicitly noise-like throughout optimization, preventing the drift that leads to unrealistic artifacts. This enforcement adds minimal cost: the projection matches the $O(N \log N)$ complexity of standard algorithms such as sorting or FFT and does not practically increase wall-clock time. In experiments, our approach reaches a comparable Aesthetic Score using only 30% of the wall-clock time required by the SOTA regularization-based method, while preventing reward hacking.

Projected Gradient Ascent for Efficient Reward-Guided Updates with One-Step Generative Models

TL;DR

The paper tackles the practicality of reward-guided generation with one-step generative models by addressing reward hacking and inefficiency in test-time latent optimization. It replaces soft regularization with hard white Gaussian noise constraints enforced via a closed-form projection onto a carefully designed feasible set, leveraging a bijective mapping to a compact spectral domain to enable an projection per iteration. Empirical results on one-step text-to-image models show higher target rewards and preserved human-aligned quality across multiple reward models, with substantially reduced wall-clock time compared to regularization-based baselines. The method also clarifies connections to prior regularization approaches, arguing that the hard-constraint formulation yields tighter Gaussian statistics and more reliable optimization, thereby making reward-guided generation more practical at deployment. Overall, the work provides a principled, efficient framework for robust test-time optimization in high-dimensional latent spaces with potential for broad applicability, while acknowledging safety considerations in reward design and deployment.

Abstract

We propose a constrained latent optimization method for reward-guided generation that preserves white Gaussian noise characteristics with negligible overhead. Test-time latent optimization can unlock substantially better reward-guided generations from pretrained generative models, but it is prone to reward hacking that degrades quality and also too slow for practical use. In this work, we make test-time optimization both efficient and reliable by replacing soft regularization with hard white Gaussian noise constraints enforced via projected gradient ascent. Our method applies a closed-form projection after each update to keep the latent vector explicitly noise-like throughout optimization, preventing the drift that leads to unrealistic artifacts. This enforcement adds minimal cost: the projection matches the complexity of standard algorithms such as sorting or FFT and does not practically increase wall-clock time. In experiments, our approach reaches a comparable Aesthetic Score using only 30% of the wall-clock time required by the SOTA regularization-based method, while preventing reward hacking.
Paper Structure (33 sections, 5 theorems, 91 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 33 sections, 5 theorems, 91 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Proposition 4.1

The mapping $\mathcal{F}$ is a bijection from $\mathbb{R}^N$ to $\mathbb{C}^{N/2}$. Moreover, if $\bm{z} \sim \mathcal{CN}(\bm{0}, \bm{I}_{N/2})$, then $\mathcal{F}^{-1}(\bm{z}) \sim \mathcal{N}(\bm{0}, \bm{I}_N)$.

Figures (7)

  • Figure 1: Effectiveness of Projection onto $\mathcal{G}_{\mathbb{R}}$. Starting from an initial latent encoding the letter ‘A’, we compare two regularization methods and our projection. Our method preserves high cosine similarity to the initial latent while reducing the spatial correlations. Unlike hwang2025moment, which requires slow gradient-based iterative projection, our method guarantees optimality with a single operation. The images are sampled from FLUX with the prompt "Piano".
  • Figure 2: Quantitative results with FLUX model. Each column corresponds to the same given reward (x-axis), and different held-out rewards (y-axis). Each point denotes the score after 200 iterations, with higher positions and more rightward placement indicating better trade-offs. For baselines, multiple points are plotted across learning rates and regularization schemes. Our method consistently achieves the best trade-off across all reward–held-out reward pairs.
  • Figure 3: Qualitative results with FLUX model. Columns denote optimization method; rows correspond to the given reward, with the prompt shown above each row. Our constrained optimization preserves realism and prompt fidelity while attaining higher target scores and strong held-out quality.
  • Figure 4: Distributions of $\ell_1$ and $\ell_2$ norms of $\bm{y}_{16} \sim \mathcal{CN}(\bm{0}, \bm{I}_{16})$. The radial plots visualize 100K samples, where radius indicates the norm and angle indicates $\arg (\bm{1}^\top \bm{y}_{16})$. In both cases, the norms are concentrated around their expected values.
  • Figure 5: Empirical distributions of spectral magnitudes for each optimization method. indicates values obtained from a white Gaussian noise, and indicates values from the optimized latent vector. For each frequency or block $p$, the corresponding magnitude or norm is plotted at its location on the horizontal axis. The rightmost panel shows the empirical and theoretical probability density of $|y_i|$.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Proposition 4.1
  • proof
  • Proposition 4.2
  • proof
  • Lemma 1.1
  • proof
  • Proposition \ref{thm:F_ind}
  • proof
  • Proposition \ref{thm:F_ind2}
  • proof