Table of Contents
Fetching ...

MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

Yapeng Mi, Hengli Li, Yanpeng Zhao, Chenxi Li, Huimin Wu, Xiaojian Ma, Song-Chun Zhu, Ying Nian Wu, Qing Li

TL;DR

MILR introduces a test-time latent reasoning strategy that operates in a unified multimodal latent space to improve text-conditioned image generation. By optimizing intermediate latent representations for both text and image tokens using a policy-gradient objective guided by a reward model, MILR achieves state-of-the-art results across GenEval, T2I-CompBench, and WISE without updating model parameters. The key finding is that joint cross-modal reasoning in the latent space yields substantial gains and enables temporal and cultural reasoning, as demonstrated by qualitative analyses. The work highlights practical benefits of test-time optimization and discusses reward-model design and potential limitations for future research.

Abstract

Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR's non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.

MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

TL;DR

MILR introduces a test-time latent reasoning strategy that operates in a unified multimodal latent space to improve text-conditioned image generation. By optimizing intermediate latent representations for both text and image tokens using a policy-gradient objective guided by a reward model, MILR achieves state-of-the-art results across GenEval, T2I-CompBench, and WISE without updating model parameters. The key finding is that joint cross-modal reasoning in the latent space yields substantial gains and enables temporal and cultural reasoning, as demonstrated by qualitative analyses. The work highlights practical benefits of test-time optimization and discusses reward-model design and potential limitations for future research.

Abstract

Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR's non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.

Paper Structure

This paper contains 34 sections, 5 equations, 13 figures, 9 tables, 5 algorithms.

Figures (13)

  • Figure 1: Latent reasoning of MILR. The black solid line denotes extracting the output vector representations $\mathbf{z}^{k}$ of the text tokens $\mathbf{z}^{(t)}$ and image tokens $\mathbf{z}^{(v)}$ to be optimized, and the black dashed line denotes decoding from the optimized latent vectors $\mathbf{z}^{k+1}$, where $\mathbf{z}=[\mathbf{z}^{(t)},\mathbf{z}^{(v)}]$.
  • Figure 2: Overview of MILR. MILR performs test-time latent reasoning in a unified latent space; it uses policy gradients to iteratively refine text & image latents $\mathbf{z}^{(t)},\mathbf{z}^{(v)}$, guided by a reward model. The reward model scores each generated image conditioned on the initial prompt.
  • Figure 3: Qualitative studies on WISE. Reasoning cues are highlighted in red.
  • Figure 4: Performance across three benchmarks for varying optimization steps.
  • Figure 5: GenEval scores with varying optimization ratios of text and image tokens.
  • ...and 8 more figures