Table of Contents
Fetching ...

Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

Kaixuan Du, Meng Cao, Hang Zhang, Yukun Wang, Xiangzhou Huang, Ni Li

Abstract

Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method which is capable of generating more reliable learning signals through a two-stage consensus mechanism. The model initially acts as an anchor, producing dominant responses; then it serves as an explorer, generating diverse auxiliary signals via a temporary unlearning process. The final training target is derived from the harmonic mean of these two signal sets. Notably, the process operates entirely without external models or supervision. Across eight benchmarks and diverse domains, DCRL consistently improves Pass@1 over majority vote while yielding more stable training dynamics. These results demonstrate that DCRL establishes a scalable path toward stronger reasoning without labels.

Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

Abstract

Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method which is capable of generating more reliable learning signals through a two-stage consensus mechanism. The model initially acts as an anchor, producing dominant responses; then it serves as an explorer, generating diverse auxiliary signals via a temporary unlearning process. The final training target is derived from the harmonic mean of these two signal sets. Notably, the process operates entirely without external models or supervision. Across eight benchmarks and diverse domains, DCRL consistently improves Pass@1 over majority vote while yielding more stable training dynamics. These results demonstrate that DCRL establishes a scalable path toward stronger reasoning without labels.
Paper Structure (41 sections, 21 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 41 sections, 21 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: An overview of Dual Consensus Reinforcement Learning (DCRL). Specifically, the policy model assumes two roles: (1) an anchor that generates dominant and reliable responses; (2) an explorer that produces diverse auxiliary signals through a temporary unlearning process.
  • Figure 2: Output distributions of the Anchor and Explorer models. The Explorer model, after the unlearning process, generates a more diverse distribution.
  • Figure 3: Comparison between Majority vote and Dual Consensus: Majority vote tends to fall into spurious consensus by over-relying on dominant but potentially incorrect response modes, while Dual Consensus mitigates this issue by converting the anchor model (which captures dominant reasoning patterns) into an explorer model via temporary unlearning. This transformation enables the framework to explore diverse alternative response modes, thereby balancing the reliability of current dominant patterns and the diversity of potential valid alternatives, and ultimately achieving more accurate answer selection.
  • Figure 4: Training Dynamics of Dual Consensus on Qwen3-8B-Base. DCRL-Anchor in Fig. \ref{['subfig:label_acc_curve']} refers to the majority vote of the anchor model in DCRL.
  • Figure 5: Comparison of reward signal between Majority Vote and Dual Consensus.
  • ...and 3 more figures