Table of Contents
Fetching ...

Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks

Davide Romano, Jonathan Schwarz, Daniele Giofré

TL;DR

This study investigates verifier-based test-time scaling (TTS) for legal reasoning in MCQA, evaluating Best-of-N and Diverse Verifier Tree Search under realistic low-$N$ budgets across five benchmarks. It systematically compares outcomes between outcome reward models (ORMs) and process reward models (PRMs) for various generator sizes (e.g., $ ext{generator} o obreak ext{70B}$) and domains, highlighting the roles of domain specialization and model scale. The key finding is that verifier-based TTS often yields limited gains for strong, general-purpose generators, but can offer substantial improvements on high-cardinality tasks when using large, legally specialized verifiers and PRMs; moreover, PRMs demonstrate cross-role robustness. Practically, the results suggest prioritizing high-quality, in-domain rewards for inference-time legal reasoning, while recognizing diminishing returns as generator power increases and the need for broader domain coverage in future work.

Abstract

Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks

TL;DR

This study investigates verifier-based test-time scaling (TTS) for legal reasoning in MCQA, evaluating Best-of-N and Diverse Verifier Tree Search under realistic low- budgets across five benchmarks. It systematically compares outcomes between outcome reward models (ORMs) and process reward models (PRMs) for various generator sizes (e.g., ) and domains, highlighting the roles of domain specialization and model scale. The key finding is that verifier-based TTS often yields limited gains for strong, general-purpose generators, but can offer substantial improvements on high-cardinality tasks when using large, legally specialized verifiers and PRMs; moreover, PRMs demonstrate cross-role robustness. Practically, the results suggest prioritizing high-quality, in-domain rewards for inference-time legal reasoning, while recognizing diminishing returns as generator power increases and the need for broader domain coverage in future work.

Abstract

Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-) and process-level (tree search) verification under realistic low- budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

Paper Structure

This paper contains 25 sections, 4 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: TTS with Llama-3.1-8B-Instruct with four different verifiers from N=4 to N=16, average over 5 legal MCQA benchmarks
  • Figure 2: RQ1 results across all benchmarks with Llama 8B as the generator
  • Figure 3: RQ2 average results with both Llama 8B and Llama 70B
  • Figure 4: RQ3 average results with both Llama 8B and Llama 70B
  • Figure 5: RQ1 average and individual benchmarks results using Llama-3.2-3B-Instruct
  • ...and 11 more figures