Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks
Davide Romano, Jonathan Schwarz, Daniele Giofré
TL;DR
This study investigates verifier-based test-time scaling (TTS) for legal reasoning in MCQA, evaluating Best-of-N and Diverse Verifier Tree Search under realistic low-$N$ budgets across five benchmarks. It systematically compares outcomes between outcome reward models (ORMs) and process reward models (PRMs) for various generator sizes (e.g., $ ext{generator} o obreak ext{70B}$) and domains, highlighting the roles of domain specialization and model scale. The key finding is that verifier-based TTS often yields limited gains for strong, general-purpose generators, but can offer substantial improvements on high-cardinality tasks when using large, legally specialized verifiers and PRMs; moreover, PRMs demonstrate cross-role robustness. Practically, the results suggest prioritizing high-quality, in-domain rewards for inference-time legal reasoning, while recognizing diminishing returns as generator power increases and the need for broader domain coverage in future work.
Abstract
Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.
