Table of Contents
Fetching ...

Weak-to-Strong Reasoning

Yuqing Yang, Yan Ma, Pengfei Liu

TL;DR

The paper tackles the challenge of supervising superhuman LLMs by proposing a two-stage weak-to-strong reasoning framework in which a strong model self-refines its training data starting from a small high-quality set and then learns from weakly supervised signals via preference optimization. Stage I focuses on positive data selection and fine-tuning (including Weak-ICL variants and iterative approaches) to recover reasoning capabilities, while Stage II uses negative samples and DPO/ORPO-based contrastive learning to teach the model to avoid weak-model errors. Across GSM8K, MATH, and OlympicArena, the approach yields substantial gains over naive full weak fine-tuning, with strong improvements in greedy decoding and pass@k metrics. The work demonstrates that self-directed data curation and preference-guided learning can significantly amplify AI reasoning abilities without external ground-truth annotations, offering a scalable path toward robust open-ended reasoning in future AI systems.

Abstract

When large language models (LLMs) exceed human-level capabilities, it becomes increasingly challenging to provide full-scale and accurate supervision for these models. Weak-to-strong learning, which leverages a less capable model to unlock the latent abilities of a stronger model, proves valuable in this context. Yet, the efficacy of this approach for complex reasoning tasks is still untested. Furthermore, tackling reasoning tasks under the weak-to-strong setting currently lacks efficient methods to avoid blindly imitating the weak supervisor including its errors. In this paper, we introduce a progressive learning framework that enables the strong model to autonomously refine its training data, without requiring input from either a more advanced model or human-annotated data. This framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. Extensive experiments on the GSM8K and MATH datasets demonstrate that our method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. This method is further validated in a forward-looking experimental setup, where Llama3-8b-instruct effectively supervises Llama3-70b on the highly challenging OlympicArena dataset. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers. All relevant code and resources are available in \url{https://github.com/GAIR-NLP/weak-to-strong-reasoning}.

Weak-to-Strong Reasoning

TL;DR

The paper tackles the challenge of supervising superhuman LLMs by proposing a two-stage weak-to-strong reasoning framework in which a strong model self-refines its training data starting from a small high-quality set and then learns from weakly supervised signals via preference optimization. Stage I focuses on positive data selection and fine-tuning (including Weak-ICL variants and iterative approaches) to recover reasoning capabilities, while Stage II uses negative samples and DPO/ORPO-based contrastive learning to teach the model to avoid weak-model errors. Across GSM8K, MATH, and OlympicArena, the approach yields substantial gains over naive full weak fine-tuning, with strong improvements in greedy decoding and pass@k metrics. The work demonstrates that self-directed data curation and preference-guided learning can significantly amplify AI reasoning abilities without external ground-truth annotations, offering a scalable path toward robust open-ended reasoning in future AI systems.

Abstract

When large language models (LLMs) exceed human-level capabilities, it becomes increasingly challenging to provide full-scale and accurate supervision for these models. Weak-to-strong learning, which leverages a less capable model to unlock the latent abilities of a stronger model, proves valuable in this context. Yet, the efficacy of this approach for complex reasoning tasks is still untested. Furthermore, tackling reasoning tasks under the weak-to-strong setting currently lacks efficient methods to avoid blindly imitating the weak supervisor including its errors. In this paper, we introduce a progressive learning framework that enables the strong model to autonomously refine its training data, without requiring input from either a more advanced model or human-annotated data. This framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. Extensive experiments on the GSM8K and MATH datasets demonstrate that our method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. This method is further validated in a forward-looking experimental setup, where Llama3-8b-instruct effectively supervises Llama3-70b on the highly challenging OlympicArena dataset. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers. All relevant code and resources are available in \url{https://github.com/GAIR-NLP/weak-to-strong-reasoning}.
Paper Structure (43 sections, 1 equation, 7 figures, 14 tables)

This paper contains 43 sections, 1 equation, 7 figures, 14 tables.

Figures (7)

  • Figure 1: (a): Test accuracy on GSM8K using Llama2-7b to supervise Llama2-70b. (b): Test accuracy on OlympicArena using Llama3-8b-instruct to supervise Llama3-70b. "Weak Floor" refers to the results of the weak model. "Full Weak FT" refers to the results of the baseline where the strong model is naively fine-tuned on the full dataset generated by the weak model. "Our Stage I" represents the results from the first stage of supervised fine-tuning using our proposed weak-to-strong method. Note that our method in Stage I produces three variants of enhanced strong models and we present the best results here. "Our Stage II" denotes the results from the second stage of preference optimization using our method.
  • Figure 2: Illustration of weak-to-strong reasoning through the strong model self-refining its training data.
  • Figure 3: Overview of our method evolving from $\mathcal{M}$$\to$$\mathcal{M}_\text{plus}$$\to$$\mathcal{M}_\text{pro}$.Left: we utilize final answer consistency to selectively filter weak and icl data from diverse sources, which is used to fine-tune the strong model $\mathcal{M}$ and obtain $\mathcal{M}_\text{plus}$ with enhanced mathematical reasoning capabilities. Right: we leverage the confidence of $\mathcal{M}_\text{plus}$ to identify contrastive samples for performance optimization, resulting in a more robust strong model $\mathcal{M}_\text{pro}$.
  • Figure 4: Main results of Stage I. "Iter. 0" presents the performance of two baselines, where "weak" indicates full weak fine-tuning, i.e., naively fine-tuning on the entire weak data, and "icl" refers to weak ICL without fine-tuning. Models connected by a line mean that they share the same training data sources. Results below "strong ceiling" present test accuracy via greedy decoding, while those above show pass@k scores ($k=10$ and $\text{temperature}=1.0$). For simplicity, we only present the pass@k scores of $\mathcal{M}_\text{hybrid-ft}$ and checkpoints that surpass it using greedy decoding, and full results are provided in §\ref{['sec:passk']}.
  • Figure 5: Results on GSM8K supervised by Gemma-2b. and are under original demonstrations, and and are under carefully selected demonstrations.
  • ...and 2 more figures