Table of Contents
Fetching ...

TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning

Qingtao Yu, Changlin Song, Minghao Sun, Zhengyang Yu, Vinay Kumar Verma, Soumya Roy, Sumit Negi, Hongdong Li, Dylan Campbell

TL;DR

The paper addresses the inefficiency of exploring many seeds for test-time diffusion-based image generation. It introduces TTSnap, a pruning framework that uses intermediate, noise-aware reward estimates to discard low-potential candidates early, combined with NARF to align rewards across noisy intermediate steps via self-distillation and curriculum training. The method yields substantial compute savings and performance gains, including improved reward growth under budget and compatibility with post-training and local optimization techniques. It emphasizes the importance of generation diversity and global search for effective test-time scaling in diffusion models.

Abstract

A prominent approach to test-time scaling for text-to-image diffusion models formulates the problem as a search over multiple noise seeds, selecting the one that maximizes a certain image-reward function. The effectiveness of this strategy heavily depends on the number and diversity of noise seeds explored. However, verifying each candidate is computationally expensive, because each must be fully denoised before a reward can be computed. This severely limits the number of samples that can be explored under a fixed budget. We propose test-time scaling with noise-aware pruning (TTSnap), a framework that prunes low-quality candidates without fully denoising them. The key challenge is that reward models are learned in the clean image domain, and the ranking of rewards predicted for intermediate estimates are often inconsistent with those predicted for clean images. To overcome this, we train noise-aware reward models via self-distillation to align the reward for intermediate estimates with that of the final clean images. To stabilize learning across different noise levels, we adopt a curriculum training strategy that progressively shifts the data domain from clean images to noise images. In addition, we introduce a new metric that measures reward alignment and computational budget utilization. Experiments demonstrate that our approach improves performance by over 16\% compared with existing methods, enabling more efficient and effective test-time scaling. It also provides orthogonal gains when combined with post-training techniques and local test-time optimization. Code: https://github.com/TerrysLearning/TTSnap/.

TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning

TL;DR

The paper addresses the inefficiency of exploring many seeds for test-time diffusion-based image generation. It introduces TTSnap, a pruning framework that uses intermediate, noise-aware reward estimates to discard low-potential candidates early, combined with NARF to align rewards across noisy intermediate steps via self-distillation and curriculum training. The method yields substantial compute savings and performance gains, including improved reward growth under budget and compatibility with post-training and local optimization techniques. It emphasizes the importance of generation diversity and global search for effective test-time scaling in diffusion models.

Abstract

A prominent approach to test-time scaling for text-to-image diffusion models formulates the problem as a search over multiple noise seeds, selecting the one that maximizes a certain image-reward function. The effectiveness of this strategy heavily depends on the number and diversity of noise seeds explored. However, verifying each candidate is computationally expensive, because each must be fully denoised before a reward can be computed. This severely limits the number of samples that can be explored under a fixed budget. We propose test-time scaling with noise-aware pruning (TTSnap), a framework that prunes low-quality candidates without fully denoising them. The key challenge is that reward models are learned in the clean image domain, and the ranking of rewards predicted for intermediate estimates are often inconsistent with those predicted for clean images. To overcome this, we train noise-aware reward models via self-distillation to align the reward for intermediate estimates with that of the final clean images. To stabilize learning across different noise levels, we adopt a curriculum training strategy that progressively shifts the data domain from clean images to noise images. In addition, we introduce a new metric that measures reward alignment and computational budget utilization. Experiments demonstrate that our approach improves performance by over 16\% compared with existing methods, enabling more efficient and effective test-time scaling. It also provides orthogonal gains when combined with post-training techniques and local test-time optimization. Code: https://github.com/TerrysLearning/TTSnap/.

Paper Structure

This paper contains 41 sections, 12 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of the best-of-$N$ TTS algorithm with our TTSnap algorithm under an equivalent compute budget. By applying early-stage pruning, TTSnap can explore a much larger pool of candidates (shown as polygons). Candidate colors indicate reward values from low (light yellow) to high (dark red). Each candidate is denoised from left to right under different initial noise. The magnifying glass represents the reward models, where the noise-aware variants are finetuned using our self-distillation strategy.
  • Figure 2: Noise-aware reward finetuning. Given a set of text prompts, we generate training data by sampling from the diffusion model and storing the intermediate Tweedie-estimated images. To train the noise-aware reward models, we introduce a curriculum self-distillation strategy that gradually shifts the training domain from clean images to increasingly noisy ones. After one epoch at each noise level, we save the model weights and proceed to the next, ensuring small domain gaps and stable, efficient training.
  • Figure 3: Rank consistency at different noise levels (noiser timesteps to the left), for three reward models, with and without our noise-aware reward finetuning (NARF) approach.
  • Figure 4: Analysis of the effect of different variables on the PickScore reward or number of explored samples $N$, with respect to (a) $N$, (b) the compute budget $B$, (c) the retention ratio $\alpha$, and (d) the pruning timestep $\tau$, and compare with the baselines. Defaults of $N$, $B$, $\alpha$, $\tau$ are $35$, $4000$, $0.3$ and $6$.
  • Figure 5: The PickScore reward of the final image with respect to the budget $B$ for (a) TTSnap (w/ NARF), TTSp (w/o NARF), and best-of-$N$, and (b) TTSnap with and without NegToMe sample diversity enhancement.
  • ...and 4 more figures