Table of Contents
Fetching ...

Diffusion Probe: Generated Image Result Prediction Using CNN Probes

Benlei Cui, Bukun Huang, Zhizeng Ye, Xuemei Dong, Tuo Chen, Hui Xue, Dingkang Yang, Longtao Huang, Jingqun Tang, Haiwen Hong

TL;DR

Diffusion Probe is a lightweight predictor that maps statistical properties of early-stage cross-attention extracted from initial denoising steps to the final image's overall quality, offering a practical solution for improving T2I generation efficiency through early quality prediction.

Abstract

Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-attention distributions and final image quality. Based on this finding, we introduce Diffusion Probe, a framework that leverages internal cross-attention maps as predictive signals. We design a lightweight predictor that maps statistical properties of early-stage cross-attention extracted from initial denoising steps to the final image's overall quality. This enables accurate forecasting of image quality across diverse evaluation metrics long before full synthesis is complete. We validate Diffusion Probe across a wide range of settings. On multiple T2I models, across early denoising windows, resolutions, and quality metrics, it achieves strong correlation (PCC > 0.7) and high classification performance (AUC-ROC > 0.9). Its reliability translates into practical gains. By enabling early quality-aware decisions in workflows such as prompt optimization, seed selection, and accelerated RL training, the probe supports more targeted sampling and avoids computation on low-potential generations. This reduces computational overhead while improving final output quality. Diffusion Probe is model-agnostic, efficient, and broadly applicable, offering a practical solution for improving T2I generation efficiency through early quality prediction.

Diffusion Probe: Generated Image Result Prediction Using CNN Probes

TL;DR

Diffusion Probe is a lightweight predictor that maps statistical properties of early-stage cross-attention extracted from initial denoising steps to the final image's overall quality, offering a practical solution for improving T2I generation efficiency through early quality prediction.

Abstract

Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-attention distributions and final image quality. Based on this finding, we introduce Diffusion Probe, a framework that leverages internal cross-attention maps as predictive signals. We design a lightweight predictor that maps statistical properties of early-stage cross-attention extracted from initial denoising steps to the final image's overall quality. This enables accurate forecasting of image quality across diverse evaluation metrics long before full synthesis is complete. We validate Diffusion Probe across a wide range of settings. On multiple T2I models, across early denoising windows, resolutions, and quality metrics, it achieves strong correlation (PCC > 0.7) and high classification performance (AUC-ROC > 0.9). Its reliability translates into practical gains. By enabling early quality-aware decisions in workflows such as prompt optimization, seed selection, and accelerated RL training, the probe supports more targeted sampling and avoids computation on low-potential generations. This reduces computational overhead while improving final output quality. Diffusion Probe is model-agnostic, efficient, and broadly applicable, offering a practical solution for improving T2I generation efficiency through early quality prediction.
Paper Structure (26 sections, 4 equations, 10 figures, 7 tables)

This paper contains 26 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Illustration of early cross-attention dispersion. Here, we present the prompt, the corresponding four cross-attention activation maps in the early denoising stage, and the final generated image. Compared to other tokens, the cross-attention activation maps of the "bird" token shows significant sparsity in spatial distribution.
  • Figure 2: Overview of the Diffusion Probe framework. Our framework takes as input the early-stage cross-attention feature maps (derived from the CrossAttn module at a probed timestep $t$) and the TimeStep Embedding. A lightweight network processes these inputs, ultimately outputting a quality score prediction for the final generated image ($x_0$). This predicted score is learned to align with a specified ground-truth Metric (e.g., aesthetic, semantic coherence) evaluated on the fully synthesized image. The Diffusion Probe then serves as a versatile tool to enable various downstream applications, such as Prompt Optimization, Seed Selection, and Efficient-GRPO training.
  • Figure 3: Visualization of the Cross-Attention Map for the token "cat" within a FluxTransformerBlock. The image illustrates the spatial attention distribution for the text token "cat" generated from the prompt "A cat holding a sign that says hello world". Regions with intense red patterns indicate high attention scores, demonstrating where the model's focus is directed in response to the specified token.
  • Figure 4: Early-stage cross-attention maps reveal object rendering fidelity. (Top) For the prompt "Woman carrying a bunch of bananas on top of her hat", the model successfully renders all objects, resulting in sharp, focused attention maps for each token. (Bottom) In contrast, for "A child ... surrounded by ... building blocks in a playroom", the model fails to generate the building blocks, and the corresponding attention map becomes highly diffuse. This demonstrates that attention statistics serve as a reliable early indicator of object-level generation success or failure (maps extracted at step $t=5$, layer 19, as detailed in Figure \ref{['fig:fluxvisual']}).
  • Figure 5: Comparison of the PickScore during training steps for our method ("Ours") versus the baseline method ("Origin") applied to Flow-GRPO. The plot demonstrates that our approach enhances the stability and convergence speed of the training process, as evidenced by the smoother fluctuations and faster rise in the PickScore across training steps. This indicates a more consistent and efficient learning process when using our method.
  • ...and 5 more figures