Table of Contents
Fetching ...

Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision

Xiaohan He, Shiyang Feng, Songtao Huang, Lei Bai, Bin Wang, Bo Zhang

TL;DR

Sci-CoE presents a co-evolving framework where a single LLM functions as both solver and verifier to enhance scientific reasoning with limited supervision. The method bootstraps with anchored learning and then scales through unsupervised co-evolution guided by a geometric consensus reward that balances reliability and diversity of verification strategies. Empirical results on GPQA-Diamond, MMLU-Pro, and UGPhysics show consistent improvements and strong scalability with unlabeled data. The work demonstrates a path toward robust, self-evolving scientific reasoning systems, while acknowledging computational costs and parameter-size considerations.

Abstract

Large language models (LLMs) have demonstrated exceptional reasoning capabilities, and co-evolving paradigms have shown promising results in domains such as code and math. However, in scientific reasoning tasks, these models remain fragile due to unreliable solution evaluation and limited diversity in verification strategies. In this work, we propose Sci-CoE, a two-stage scientific co-evolving framework that enables models to self-evolve as both solver and verifier through a transition from sparse supervision to unsupervised learning. In the first stage, the model uses a small set of annotated data to establish fundamental correctness judgment anchors for the Verifier. In the second stage, we introduce a geometric reward mechanism that jointly considers consensus, reliability, and diversity, driving large-scale self-iteration on unlabeled data. Experiments on several general scientific benchmarks demonstrate that Sci-CoE enhances complex reasoning capabilities and exhibits strong scalability, facilitating the construction of more robust and diverse evaluation systems. Codes are available at https://github.com/InternScience/Sci-CoE.

Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision

TL;DR

Sci-CoE presents a co-evolving framework where a single LLM functions as both solver and verifier to enhance scientific reasoning with limited supervision. The method bootstraps with anchored learning and then scales through unsupervised co-evolution guided by a geometric consensus reward that balances reliability and diversity of verification strategies. Empirical results on GPQA-Diamond, MMLU-Pro, and UGPhysics show consistent improvements and strong scalability with unlabeled data. The work demonstrates a path toward robust, self-evolving scientific reasoning systems, while acknowledging computational costs and parameter-size considerations.

Abstract

Large language models (LLMs) have demonstrated exceptional reasoning capabilities, and co-evolving paradigms have shown promising results in domains such as code and math. However, in scientific reasoning tasks, these models remain fragile due to unreliable solution evaluation and limited diversity in verification strategies. In this work, we propose Sci-CoE, a two-stage scientific co-evolving framework that enables models to self-evolve as both solver and verifier through a transition from sparse supervision to unsupervised learning. In the first stage, the model uses a small set of annotated data to establish fundamental correctness judgment anchors for the Verifier. In the second stage, we introduce a geometric reward mechanism that jointly considers consensus, reliability, and diversity, driving large-scale self-iteration on unlabeled data. Experiments on several general scientific benchmarks demonstrate that Sci-CoE enhances complex reasoning capabilities and exhibits strong scalability, facilitating the construction of more robust and diverse evaluation systems. Codes are available at https://github.com/InternScience/Sci-CoE.
Paper Structure (28 sections, 10 equations, 4 figures, 5 tables, 2 algorithms)

This paper contains 28 sections, 10 equations, 4 figures, 5 tables, 2 algorithms.

Figures (4)

  • Figure 1: Examples of Scientific Question, Generated Solution and Verification Strategies.
  • Figure 2: The overall pipeline of Sci-CoE
  • Figure 3: Performance of the model at different stages of the training process on GPQA-D evaluation data, the three broken lines represent the average accuracy of the generated solutions, the average accuracy of the generated validation strategies and Best-of-N(BoN) accuracy, using 16 generated solutions and 16 generated strategies. The baseline is Qwen2.5-7B-Instruct, and the rest of the model names represent the number of training steps in that stage.
  • Figure 4: Visualization of geometric reward and quantitative analysis of verification strategies. (a)-(d) display PCA projections of strategy embeddings in a polar coordinate system across different training stages, respectively corresponding to Baseline Model, Stage 1 only, Stage 2 with Naive Consensus Reward, and Stage 2 with Geometric Reward. The angular distribution of points indicates diversity, while the radial distance to the cluster center represents strategy reliability, with closer points indicating more stable. And the color represents the consistency score, with closer to green indicating higher scores. (e)-(g) illustrate mean consistency, reliability, and diversity reward scores of different models.