Weak-to-Strong Reasoning
Yuqing Yang, Yan Ma, Pengfei Liu
TL;DR
The paper tackles the challenge of supervising superhuman LLMs by proposing a two-stage weak-to-strong reasoning framework in which a strong model self-refines its training data starting from a small high-quality set and then learns from weakly supervised signals via preference optimization. Stage I focuses on positive data selection and fine-tuning (including Weak-ICL variants and iterative approaches) to recover reasoning capabilities, while Stage II uses negative samples and DPO/ORPO-based contrastive learning to teach the model to avoid weak-model errors. Across GSM8K, MATH, and OlympicArena, the approach yields substantial gains over naive full weak fine-tuning, with strong improvements in greedy decoding and pass@k metrics. The work demonstrates that self-directed data curation and preference-guided learning can significantly amplify AI reasoning abilities without external ground-truth annotations, offering a scalable path toward robust open-ended reasoning in future AI systems.
Abstract
When large language models (LLMs) exceed human-level capabilities, it becomes increasingly challenging to provide full-scale and accurate supervision for these models. Weak-to-strong learning, which leverages a less capable model to unlock the latent abilities of a stronger model, proves valuable in this context. Yet, the efficacy of this approach for complex reasoning tasks is still untested. Furthermore, tackling reasoning tasks under the weak-to-strong setting currently lacks efficient methods to avoid blindly imitating the weak supervisor including its errors. In this paper, we introduce a progressive learning framework that enables the strong model to autonomously refine its training data, without requiring input from either a more advanced model or human-annotated data. This framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. Extensive experiments on the GSM8K and MATH datasets demonstrate that our method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. This method is further validated in a forward-looking experimental setup, where Llama3-8b-instruct effectively supervises Llama3-70b on the highly challenging OlympicArena dataset. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers. All relevant code and resources are available in \url{https://github.com/GAIR-NLP/weak-to-strong-reasoning}.
