Tool Verification for Test-Time Reinforcement Learning

Ruotong Liao; Nikolai Röhrich; Xiaohan Wang; Yuhui Zhang; Yasaman Samadzadeh; Volker Tresp; Serena Yeung-Levy

Tool Verification for Test-Time Reinforcement Learning

Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy

TL;DR

T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

Abstract

Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

Tool Verification for Test-Time Reinforcement Learning

TL;DR

T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

Abstract

Paper Structure (49 sections, 11 equations, 11 figures, 3 tables)

This paper contains 49 sections, 11 equations, 11 figures, 3 tables.

Introduction
Related Works
Verification for Test Time Scaling
Test-Time Training
The Failure Mode: How Unverified Consensus Induces Reward Bias
Test Time Reinforcement Learning
Spurious Majority as a Biased Pseudo-Label
Self-consensus can estimate wrong labels.
Self-reinforcing feedback loop and incorrect mode collapse.
Method: Tool Verification for Test Time Reinforcement Learning
Verifier
Verifier.
Verification Tool
Tool execution as external evidence.
Verification Weight
...and 34 more sections

Figures (11)

Figure 1: The concept of T3RL. Top: Majority-vote pseudo-labels can be spurious. T3RL introduces verification to suppress false-popular pseudo-labels. Bottom: T3RL introduces test-time verification into self-evolvement via tool-executed evidence (e.g., code interpreter) to stabilize training with verified rollouts. Right: T3RL achieves consistent gains, yielding evidence-grounded self-evolution.
Figure 2: T3RL: Tool Verification for Test-Time Reinforcement Learning.Verifier: an LLM verifier parses each sampled rollout $y_i$ into an answer $\hat{a}_i$ and examine the returned execution result ${a}_i$, yielding a validity flag $v_i$ for each rollout. Tool verification: the verifier compiles the rollout’s claimed computations into lightweight Python and queries a code interpreter to obtain executable evidence of ${a}_i$. Verification weighted majority voting: A verification-aware pseudo-label $\tilde{y}^{*}$ is formed that verified rollouts receive $w_i$ vote mass and unverified rollouts receive a unit vote, and assign binary rewards $r^v_i=\mathbbm{1}[a_i=\tilde{y}^{*}]$ for test-time RL updates.
Figure 3: Spurious reward in reinforced cycle of TTRL.
Figure 4: Relative gain over baseline trend (T3RL vs TTRL).
Figure 5: Ablation on verifier and verification tool. Left: Adding an LLM verifier improves TTRL even without tool execution. Right: Code execution significantly strengthens verification.
...and 6 more figures

Tool Verification for Test-Time Reinforcement Learning

TL;DR

Abstract

Tool Verification for Test-Time Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)