Table of Contents
Fetching ...

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li, Edward De Brouwer, Carl Edwards, Wei Xue, Sirui Han, Yike Guo, Gabriele Scalia

TL;DR

It is demonstrated that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.

Abstract

In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

TL;DR

It is demonstrated that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.

Abstract

In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.
Paper Structure (63 sections, 4 theorems, 54 equations, 19 figures, 9 tables, 1 algorithm)

This paper contains 63 sections, 4 theorems, 54 equations, 19 figures, 9 tables, 1 algorithm.

Key Result

Theorem 3.4

Assume Assumption ass:calibration holds and that $(\mathcal{D},\mathcal{N})$ satisfies $(c,q,\eta)$-soft robust expansion. Let $\bar{\rho}_\eta := \Pr\!(z_t\notin R_\eta(f_\tau))$ and $c' := \frac{c}{(1-\alpha)+c\alpha}$. If then for any $\tau$ such that $1-2c'\alpha>0$,

Figures (19)

  • Figure 1: Dual-Consensus weak-to-strong supervision framework for efficient PRM training in biological reasoning, where expert step-level verification is costly or unavailable.
  • Figure 2: Performance comparison across four cell lines.BoN (Ours w/ Full Set (Multi-Label)) achieves the highest average F1 scores, outperforming single-label and SFT baselines by effectively aggregating multiple weak supervisory signals. Note: RPE1 is OOD for PRM; SFT is trained with RPE1.
  • Figure 3: Benchmarking Performance and Efficiency of BoN (Ours w/ CellProfiler).Top (instance-level) and bottom (label-level) efficiency analyses show that we can achieve comparable performance while using fewer training instances and fewer weak step labels, thereby suggesting the effectiveness of the DC-W2S.
  • Figure 4: Effect of embedding choice. Performance on RPE1--DE for different label-pattern configurations across semantic and biologically grounded embedding spaces. Embeddings with richer biological structure (e.g., ESM, CellProfiler) yield larger gains from neighborhood-reliable supervision (P3).
  • Figure 5: The exact prompt template used for weak supervision generation via LLM-as-a-judge. The blue text indicates where the injected context differs across the three methods (LF-Direct, Context, Analogical).
  • ...and 14 more figures

Theorems & Definitions (13)

  • Definition 3.2: $\eta$-robust
  • Definition 3.3: Soft robust expansion
  • Theorem 3.4: Soft weak-label correction for PRM
  • Remark 3.5: Informal: effect of miscalibration
  • Remark 3.6: Informal: connection to BoN
  • Remark 3.7: Neighborhood label consistency
  • Lemma 5.1
  • proof
  • Definition 5.2: Pointwise $\eta$-robust hypothesis class
  • Lemma 5.3: Robust complexity is controlled by standard complexity
  • ...and 3 more