Table of Contents
Fetching ...

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, Hongsheng Li

TL;DR

ReflectionFlow introduces a self-refinement paradigm for text-to-image diffusion models that leverages inference-time computation along three axes: noise initialization, prompt guidance, and explicit reflections. A large-scale GenRef dataset (1 million triplets) and GenRef-CoT annotations underpin efficient reflection tuning of diffusion transformers like FLUX.1-dev, enabling a corrector to iteratively refine outputs using multimodal feedback from reward models and LLMs. Experiments on GenEval demonstrate substantial quality gains with an inference-time budget, particularly on challenging prompts, highlighting the potential of progressive reflection to close the gap between fixed-budget generation and high-fidelity image synthesis.

Abstract

Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance; and most notably, (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks.

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

TL;DR

ReflectionFlow introduces a self-refinement paradigm for text-to-image diffusion models that leverages inference-time computation along three axes: noise initialization, prompt guidance, and explicit reflections. A large-scale GenRef dataset (1 million triplets) and GenRef-CoT annotations underpin efficient reflection tuning of diffusion transformers like FLUX.1-dev, enabling a corrector to iteratively refine outputs using multimodal feedback from reward models and LLMs. Experiments on GenEval demonstrate substantial quality gains with an inference-time budget, particularly on challenging prompts, highlighting the potential of progressive reflection to close the gap between fixed-budget generation and high-fidelity image synthesis.

Abstract

Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance; and most notably, (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks.

Paper Structure

This paper contains 27 sections, 4 equations, 19 figures, 2 tables, 1 algorithm.

Figures (19)

  • Figure 1: Construction pipelines and statistics of our GenRef dataset. We collect our reflection triplets (flawed images, enhanced images, textual reflections) from four distinct data sources, including: rule-based data, reward-based data, long-short prompt data, and editing data.
  • Figure 2: Comparisons of textual reflection generated by original Qwen2.5-VL-7B, our fine-tuned image reflector, and GPT-4o.
  • Figure 3: Illustrations of three different inference-time scaling dimensions for text-to-image diffusion models.
  • Figure 4: Left: The choice of verifier significantly impacts the effectiveness of inference-time scaling methods. Middle: By efficiently scaling the inference-time budget, ReflectionFlow achieves substantial performance improvements, requiring 10 times fewer samples compared to naive noise-level scaling. Right: ReflectionFlow demonstrates notably greater performance gains on challenging samples.
  • Figure 5: Visualization of complex reasoning. Starting from initially incorrect generations (the first image), ReflectionFlow iteratively reflects on and corrects errors, progressively producing images that accurately align with the provided prompts and reflection instructions .
  • ...and 14 more figures