Table of Contents
Fetching ...

Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, Aditya Grover

TL;DR

Reflect-DiT tackles the inefficiency of training-time scaling in text-to-image diffusion by enabling iterative, in-context refinement using past generations and natural-language feedback from a vision-language model. The system couples a VLM feedback judge with a Diffusion Transformer that consumes a fixed-length in-context history through a Context Transformer to generate improved outputs. On GenEval, Reflect-DiT achieves a state-of-the-art 0.81 with only 20 samples per prompt, outperforming larger models that rely on thousands of samples, and demonstrating practical, scalable inference-time gains. The work highlights the promise of reflection-based feedback for diffusion models while noting limitations of the VLM judge and scene-detail sensitivity, pointing to future work in auditing feedback quality and expanding robustness.

Abstract

The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using a significantly larger model (SANA-1.5-4.8B) with 2048 samples under the best-of-N approach.

Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

TL;DR

Reflect-DiT tackles the inefficiency of training-time scaling in text-to-image diffusion by enabling iterative, in-context refinement using past generations and natural-language feedback from a vision-language model. The system couples a VLM feedback judge with a Diffusion Transformer that consumes a fixed-length in-context history through a Context Transformer to generate improved outputs. On GenEval, Reflect-DiT achieves a state-of-the-art 0.81 with only 20 samples per prompt, outperforming larger models that rely on thousands of samples, and demonstrating practical, scalable inference-time gains. The work highlights the promise of reflection-based feedback for diffusion models while noting limitations of the VLM judge and scene-detail sensitivity, pointing to future work in auditing feedback quality and expanding robustness.

Abstract

The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using a significantly larger model (SANA-1.5-4.8B) with 2048 samples under the best-of-N approach.

Paper Structure

This paper contains 44 sections, 1 equation, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Reflect-DiT iteratively refines image generation by using a vision-language model (VLM) to critique generations and a Diffusion Transformer (DiT) to self-improve using past generations and feedback. Specifically, at each generation step N, feedback from previous iterations (N-3, N-2, N-1, …) are incorporated to progressively improve future generations. Unlike traditional best-of-N sampling, Reflect-DiT actively corrects errors in object count, position, and attributes, enabling more precise generations with fewer samples.
  • Figure 2: Architecture of Reflect-DiT. Given a prompt, past images and feedback, we first encode the images into a set of vision embeddings $[V_1,V_2,\dots]$ using a vision encoder, and encode text feedback to a set of text embeddings $[E_1,E_2...]$. We then concatenate these embeddings into a single sequence $M$, and pass it through the Context Transformer to obtain $M'$. The extra context $M'$ is concatenated directly after the standard prompt embeddings and passed into the cross-attention layers of the Diffusion Transformer (DiT).
  • Figure 3: Side-by-side qualitative comparison of Reflect-DiT and best-of-N sampling. Reflect-DiT leverages feedback to iteratively refine image generations, resulting in more accurate and visually coherent outputs. In the first example, Reflect-DiT progressively adjusts object positions to better satisfy the prompt "a cup left of an umbrella," achieving significantly better image-text alignment than best-of-N sampling. The second example demonstrates how Reflect-DiT corrects multiple counting constraints ("five monarch butterflies" and "a single dandelion") over successive iterations, gradually converging to the correct solution. Lastly, in the rightmost example, Reflect-DiT uses in-context feedback to refine object shapes, producing a more precise and intentional design compared to best-of-N.
  • Figure 4: Comparison of Reflect-DiT with other finetuning methods. We find that Reflect-DiT is able to consistently outperform finetuning methods, like supervised finetuning (SFT) and Diffusion-DPO (DPO). Using only 4 samples, Reflect-DiT can outperform related finetuning baselines using best-of-20 sampling.
  • Figure 5: Human evaluation win-rate (%) on PartiPrompts dataset. We perform a user study to evaluate the effectiveness of Reflect-DiT in broadly improving text-to-image generation. Results show that human evaluators consistently prefer generations from Reflect-DiT over best-of-N sampling.
  • ...and 4 more figures