Table of Contents
Fetching ...

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

Jingyuan Wang, Yankai Chen, Zhonghang Li, Chao Huang

TL;DR

This work tackles the resource-intensive nature of supervised fine-tuning for improving LLM reasoning by proposing LightReasoner, which leverages behavioral divergence between a strong expert and a weaker amateur to locate high-impact reasoning moments. It introduces a two-stage process: sampling informative steps via divergence metrics and constructing contrastive supervision, followed by self-distillation training to amplify the expert's reasoning strengths without ground-truth labels. The approach yields robust improvements across seven mathematical benchmarks, with up to 28.1% accuracy gains and dramatic efficiency savings (e.g., ~90% time reduction, ~80% fewer sampled problems, ~99% fewer tuned tokens). The results emphasize that domain expertise, rather than mere model scale, drives the effectiveness of contrastive learning, offering a scalable, label-free path to enhance LLM reasoning in diverse settings.

Abstract

Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

TL;DR

This work tackles the resource-intensive nature of supervised fine-tuning for improving LLM reasoning by proposing LightReasoner, which leverages behavioral divergence between a strong expert and a weaker amateur to locate high-impact reasoning moments. It introduces a two-stage process: sampling informative steps via divergence metrics and constructing contrastive supervision, followed by self-distillation training to amplify the expert's reasoning strengths without ground-truth labels. The approach yields robust improvements across seven mathematical benchmarks, with up to 28.1% accuracy gains and dramatic efficiency savings (e.g., ~90% time reduction, ~80% fewer sampled problems, ~99% fewer tuned tokens). The results emphasize that domain expertise, rather than mere model scale, drives the effectiveness of contrastive learning, offering a scalable, label-free path to enhance LLM reasoning in diverse settings.

Abstract

Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner

Paper Structure

This paper contains 53 sections, 34 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Efficiency and performance comparison between SFT and LightReasoner. LightReasoner achieves competitive or superior accuracy while substantially reducing resource consumption.
  • Figure 2: Most tokens show minimal KL divergence, with only few exhibiting elevated values.
  • Figure 2: Efficiency comparison between SFT and LightReasoner across total time, sampled problems, tuned tokens, and average accuracy improvement over 7 benchmarks.
  • Figure 3: Predictable tokens yield near-zero KL divergence, while critical steps trigger notable spikes.
  • Figure 4: Overview of the LightReasoner framework. Sampling Stage: Expert and Amateur models generate distributions $\pi_E$ and $\pi_A$. Informative step selection retains steps with $D_{\text{KL}}(\pi_E \parallel \pi_A) > \beta$, and contrastive supervision constructs soft labels $v_C$ capturing the Expert's advantage through Expert-Amateur contrast. Fine-tuning Stage: The Expert model is enhanced by minimizing the KLD between its output and $v_C$.
  • ...and 4 more figures