Table of Contents
Fetching ...

Self-Corrected Image Generation with Explainable Latent Rewards

Yinyi Luo, Hrishikesh Gokhale, Marios Savvides, Jindong Wang, Shengfeng He

Abstract

Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.

Self-Corrected Image Generation with Explainable Latent Rewards

Abstract

Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.

Paper Structure

This paper contains 43 sections, 15 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: We propose xLARD, a self-correcting generation framework guided by explainable latent rewards. Left: Compared to the baseline, xLARD more faithfully adheres to prompts involving counting, spatial positioning, and color composition. Each example pairs the baseline output with our result for the same prompt. Right: Performance gain versus training data size on Geneval and DPGBench benchmarks, showing that xLARD achieves higher gains with fewer samples.
  • Figure 2: Overview of the xLARD framework. Given a prompt $p$, the frozen backbone encodes it into a latent representation $z_0$. The residual corrector $\Delta_\theta$ refines $z_0$ under multi-dimensional reward guidance, producing a corrected latent $z_c$ that is decoded into an image $\hat{x}$. Image-level rewards are projected back to the latent space via a learnable reward projector $R_\phi$, allowing end-to-end, interpretable correction learning. During inference, URC functions as a lightweight latent modifier with no additional sampling or retraining.
  • Figure 3: Qualitative comparison of image generation/editing performance between HermesFlow and our proposed approach.
  • Figure 4: Token-level contributions for misalignment detection. Positive bars indicate tokens aligned with the image, negative bars indicate tokens driving residual corrections.
  • Figure 5: Visualization of latent residual corrections. The high-intensity regions in the correction map indicate where the residual module most strongly adjusts the latent features. The prompt used for this example is "A skateboarder performing a jump mid-air above a concrete ramp, another person watching from the left."
  • ...and 5 more figures