Toward Early Quality Assessment of Text-to-Image Diffusion Models

Huanlei Guo; Hongxin Wei; Bingyi Jing

Toward Early Quality Assessment of Text-to-Image Diffusion Models

Huanlei Guo, Hongxin Wei, Bingyi Jing

TL;DR

This work introduces Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process, and observes that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure--that strongly correlates with final image fidelity.

Abstract

Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.

Toward Early Quality Assessment of Text-to-Image Diffusion Models

TL;DR

Abstract

Paper Structure (27 sections, 8 equations, 11 figures, 10 tables)

This paper contains 27 sections, 8 equations, 11 figures, 10 tables.

Introduction
Preliminaries: Generation-then-select
Proposed Method
Problem Formulation: Assessing Quality from Partial Generative States
Model Architecture: Early Structural Probes
Training Objectives: Aligning with Evaluators and Prompts
Applications: Selective Sampling and Beyond
Related Work
Efficiency in Diffusion Models
Text-to-Image Evaluation
Diffusion Features for Downstream Vision Tasks
Experimental Results
Experimental Setup
Early Structural Evidence in Denoiser Features
Quantitative Analysis: Predicting Final Quality from Partial States
...and 12 more sections

Figures (11)

Figure 1: Top: Snapshots of the reverse process in stable diffusion sampler from noisy latent to clean image. Middle: Denoiser architecture. Bottom: Visualization of hidden states. It shows that coarse layout and object contours emerge early and change slowly, which correlates with the final image.
Figure 2: Overview of Probe-Select training. The model receive the intermediate denoiser activations and the timestep $t$ to produce the final quality. An additional text-aligned InfoNCE loss is employed for meaningful representation learning.
Figure 3: PCA visualization for denoising network of Stable Diffusion 2 across time.
Figure 4: Boxplots of Evaluation Metrics of Samples Generated by SD3-L. All metric scores are normalized by dividing by their respective maximum values for visualization consistency.
Figure 5: The relationship of number of candidates (N) and selected seeds (K).
...and 6 more figures

Toward Early Quality Assessment of Text-to-Image Diffusion Models

TL;DR

Abstract

Toward Early Quality Assessment of Text-to-Image Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)