Table of Contents
Fetching ...

Tiny Inference-Time Scaling with Latent Verifiers

Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Abstract

Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

Tiny Inference-Time Scaling with Latent Verifiers

Abstract

Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.
Paper Structure (26 sections, 4 equations, 10 figures, 7 tables)

This paper contains 26 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: (A) Comparison between standard inference-time scaling and VHS. VHS skips part of the generation pipeline and avoids the decoding and re-encoding steps. (B) VHS achieves a comparable quality score on GenEval ghosh2023geneval in just 57% of the compute time.
  • Figure 2: Comparison between a standard generation-verification pipeline (top) and VHS (bottom). VHS consumes visual features directly from the hidden states of the generator, bypassing subsequent DiT layers, autoencoder (AE) decoding, and CLIP-based re-encoding, significantly reducing sampling and verification overhead.
  • Figure 3: Visual comparison of the best pick images by different verifiers for GenEval-generated images.
  • Figure 4: Efficient Best-of-N pipeline with VHS.
  • Figure 5: Overall accuracy (%) of SANA-Sprint chen2025sana on GenEval ghosh2023geneval across time (seconds) TFLOPs, and VRAM usage (GB).
  • ...and 5 more figures