Table of Contents
Fetching ...

Tool Verification for Test-Time Reinforcement Learning

Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy

TL;DR

T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

Abstract

Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

Tool Verification for Test-Time Reinforcement Learning

TL;DR

T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

Abstract

Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.
Paper Structure (49 sections, 11 equations, 11 figures, 3 tables)

This paper contains 49 sections, 11 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: The concept of T3RL. Top: Majority-vote pseudo-labels can be spurious. T3RL introduces verification to suppress false-popular pseudo-labels. Bottom: T3RL introduces test-time verification into self-evolvement via tool-executed evidence (e.g., code interpreter) to stabilize training with verified rollouts. Right: T3RL achieves consistent gains, yielding evidence-grounded self-evolution.
  • Figure 2: T3RL: Tool Verification for Test-Time Reinforcement Learning.Verifier: an LLM verifier parses each sampled rollout $y_i$ into an answer $\hat{a}_i$ and examine the returned execution result ${a}_i$, yielding a validity flag $v_i$ for each rollout. Tool verification: the verifier compiles the rollout’s claimed computations into lightweight Python and queries a code interpreter to obtain executable evidence of ${a}_i$. Verification weighted majority voting: A verification-aware pseudo-label $\tilde{y}^{*}$ is formed that verified rollouts receive $w_i$ vote mass and unverified rollouts receive a unit vote, and assign binary rewards $r^v_i=\mathbbm{1}[a_i=\tilde{y}^{*}]$ for test-time RL updates.
  • Figure 3: Spurious reward in reinforced cycle of TTRL.
  • Figure 4: Relative gain over baseline trend (T3RL vs TTRL).
  • Figure 5: Ablation on verifier and verification tool. Left: Adding an LLM verifier improves TTRL even without tool execution. Right: Code execution significantly strengthens verification.
  • ...and 6 more figures