Parallel Test-Time Scaling with Multi-Sequence Verifiers

Yegon Kim; Seungyoo Lee; Chaeyun Jang; Hyungi Lee; Juho Lee

Parallel Test-Time Scaling with Multi-Sequence Verifiers

Yegon Kim, Seungyoo Lee, Chaeyun Jang, Hyungi Lee, Juho Lee

TL;DR

The Multi-Sequence Verifier (MSV), the first verifier designed to jointly process all candidate solutions and model their interactions, is introduced and achieves improved calibration, which directly enhances best-of-N selection performance.

Abstract

Parallel test-time scaling, which generates multiple candidate solutions for a single problem, is a powerful technique for improving large language model performance. However, it is hindered by two key bottlenecks: accurately selecting the correct solution from the candidate pool, and the high inference latency from generating many full solutions. We argue that both challenges are fundamentally linked to verifier calibration. A well-calibrated verifier not only improves answer selection, but also enables early-stopping strategies to reduce latency. However, existing verifiers are limited as they score each candidate in isolation, overlooking rich contextual information across the set of candidates. To address this, we introduce the Multi-Sequence Verifier (MSV), the first verifier designed to jointly process all candidate solutions and model their interactions. MSV achieves improved calibration, which directly enhances best-of-N selection performance. We further introduce a streaming MSV variant that empowers a novel early-stopping framework. Our novel framework fully leverages parallel decoding, which contrasts with the existing multi-sequence early exit works that decode sequences one by one and thus incur significant latency. In this novel setting, MSV can achieve the same target accuracy with around half the latency that would be required with its counterpart that scores each solution in isolation.

Parallel Test-Time Scaling with Multi-Sequence Verifiers

TL;DR

Abstract

Paper Structure (69 sections, 23 equations, 8 figures, 36 tables, 1 algorithm)

This paper contains 69 sections, 23 equations, 8 figures, 36 tables, 1 algorithm.

Introduction
Related Work
Calibration of LLM Outputs
Parallel Test-Time Scaling
Adaptive Scaling for Efficiency
Method
Problem Setup
Terminal Answers.
Streaming Answers.
Correctness and learning objectives.
Multi-Sequence Verifier
Input representation for .
Multi-mask transformer blocks.
Feature augmentation.
Constructing final predictions.
...and 54 more sections

Figures (8)

Figure 1: Illustration of our Multi-Sequence-Verifier (MSV) that uses multiple attention masks in its Multi-Mask Transformer Block (MMTB) to predict the correctness of each answer. The different attention masks allow MSV to flexibly leverage information both across and within sequences.
Figure 2: Brier scores($\downarrow$) of baselines and MSV$_N$ in the Terminal Answers setting. OB refers to OlympiadBench, and OM refers to Omni-MATH. Full results are in \ref{['tab:ta_auroc_brier']}.
Figure 3: Best-of-N accuracy vs. $N$, in Terminal Answers setting. Full results are in \ref{['tab:acc_bon_only_reordered_terminal']}.
Figure 4: Probability estimates for a single problem with 16 completions, comparing MSV$_1$, MSV$_1$+WV, and MSV$_{16}$.
Figure 5: Reliability diagrams of verifier confidence on best-of-64 answers (AIME).
...and 3 more figures

Parallel Test-Time Scaling with Multi-Sequence Verifiers

TL;DR

Abstract

Parallel Test-Time Scaling with Multi-Sequence Verifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (8)