Table of Contents
Fetching ...

What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan

Abstract

Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper-Yan/SCRL.

What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

Abstract

Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper-Yan/SCRL.
Paper Structure (38 sections, 9 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 38 sections, 9 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of pseudo-labeling strategies under weak consensus. (a) Majority voting assigns the positive label despite dispersed answer distribution. (b) SCRL abstains from positive labeling when consensus is insufficient and identifies negative labels.
  • Figure 2: Overview of the SCRL framework. SCRL addresses test-time label noise through three components: selective positive pseudo-labeling enforces strict consensus thresholds to prevent reinforcing unreliable majorities; entropy-gated negative pseudo-labeling identifies negative labels by isolating answers that are both rare and exhibit high uncertainty, pruning the search space without eliminating valid candidates; dynamic reward shaping constructs distribution-aware rewards that scale with consensus strength and penalize uncertainty trajectories.
  • Figure 3: Statistics of positive and negative pseudo-label estimation on the AMC dataset using Qwen2.5-3B.
  • Figure 4: Training dynamics of SCRL and TTRL on Qwen2.5-3B across three mathematical benchmarks.
  • Figure 5: Training dynamics of SCRL and TTRL on Qwen2.5-Math-7B across three mathematical benchmarks.