Tiny Inference-Time Scaling with Latent Verifiers

Davide Bucciarelli; Evelyn Turri; Lorenzo Baraldi; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara

Tiny Inference-Time Scaling with Latent Verifiers

Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Abstract

Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

Tiny Inference-Time Scaling with Latent Verifiers

Abstract

Paper Structure (26 sections, 4 equations, 10 figures, 7 tables)

This paper contains 26 sections, 4 equations, 10 figures, 7 tables.

Introduction
Related Work
Proposed Method
Preliminaries
Latent Verifier
Training Procedure Overview
Experimental Results
Implementation Details
Experimental Setting
Latency Estimation of VHS
Performance on GenEval
Ablation Studies
Generalization to Other Generators
Conclusion
Additional Implementation Details
...and 11 more sections

Figures (10)

Figure 1: (A) Comparison between standard inference-time scaling and VHS. VHS skips part of the generation pipeline and avoids the decoding and re-encoding steps. (B) VHS achieves a comparable quality score on GenEval ghosh2023geneval in just 57% of the compute time.
Figure 2: Comparison between a standard generation-verification pipeline (top) and VHS (bottom). VHS consumes visual features directly from the hidden states of the generator, bypassing subsequent DiT layers, autoencoder (AE) decoding, and CLIP-based re-encoding, significantly reducing sampling and verification overhead.
Figure 3: Visual comparison of the best pick images by different verifiers for GenEval-generated images.
Figure 4: Efficient Best-of-N pipeline with VHS.
Figure 5: Overall accuracy (%) of SANA-Sprint chen2025sana on GenEval ghosh2023geneval across time (seconds) TFLOPs, and VRAM usage (GB).
...and 5 more figures

Tiny Inference-Time Scaling with Latent Verifiers

Abstract

Tiny Inference-Time Scaling with Latent Verifiers

Authors

Abstract

Table of Contents

Figures (10)