Table of Contents
Fetching ...

Toward Early Quality Assessment of Text-to-Image Diffusion Models

Huanlei Guo, Hongxin Wei, Bingyi Jing

TL;DR

This work introduces Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process, and observes that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure--that strongly correlates with final image fidelity.

Abstract

Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.

Toward Early Quality Assessment of Text-to-Image Diffusion Models

TL;DR

This work introduces Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process, and observes that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure--that strongly correlates with final image fidelity.

Abstract

Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.
Paper Structure (27 sections, 8 equations, 11 figures, 10 tables)

This paper contains 27 sections, 8 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Top: Snapshots of the reverse process in stable diffusion sampler from noisy latent to clean image. Middle: Denoiser architecture. Bottom: Visualization of hidden states. It shows that coarse layout and object contours emerge early and change slowly, which correlates with the final image.
  • Figure 2: Overview of Probe-Select training. The model receive the intermediate denoiser activations and the timestep $t$ to produce the final quality. An additional text-aligned InfoNCE loss is employed for meaningful representation learning.
  • Figure 3: PCA visualization for denoising network of Stable Diffusion 2 across time.
  • Figure 4: Boxplots of Evaluation Metrics of Samples Generated by SD3-L. All metric scores are normalized by dividing by their respective maximum values for visualization consistency.
  • Figure 5: The relationship of number of candidates (N) and selected seeds (K).
  • ...and 6 more figures