Table of Contents
Fetching ...

Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection

Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, Maosong Sun

TL;DR

The paper tackles the challenge of distilling long-form reasoning from powerful teachers to small language models for mathematical tasks, where naive CoT transfer struggles due to capacity gaps. It introduces ORION, an error-aware self-reflection framework that refines teacher CoTs by leveraging the student’s own errors through error exposure and self-reflection, generating targeted supervision for supervised fine-tuning. Empirical results across GSM-Hard, MATH500, AIME24, and AMC23 show ORION consistently improves accuracy, training stability, and CoT quality, with ablations confirming the complementary roles of error exposure and self-reflection. The approach generalizes across multiple backbones and reduces verbosity of reasoning while boosting correctness, offering a practical pathway to more reliable reasoning distillation, albeit at the cost of dependence on closed-source reasoning LLMs for data generation.

Abstract

Large Language Models (LLMs) have exhibited strong reasoning capabilities and achieved remarkable performance in mathematical problem-solving tasks. Recently, distilling reasoning ability from long-form Chains-of-Thought (CoTs) has emerged as a promising approach for enhancing Small Language Models (SLMs). Existing studies typically treat SLMs as student models and use long-form CoTs as supervision signals for Supervised Fine-Tuning (SFT) to transfer reasoning ability. However, such long-form CoT teachers are usually unaware of the student model's capacity, which limits the effective utilization of the provided reasoning traces. To overcome this limitation, we propose errOr-aware self-ReflectION (ORION), a framework that refines teacher CoTs through an Error-Aware Reflection process. ORION enables the student model to construct more tailored teacher CoTs by refining teacher CoTs and incorporating its own reasoning errors. Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines. Further analysis reveals that the CoTs constructed by ORION exhibit higher coherence and logical consistency, thereby serving as more effective supervision signals for SFT. All codes are available at https://github.com/NEUIR/ORION.git.

Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection

TL;DR

The paper tackles the challenge of distilling long-form reasoning from powerful teachers to small language models for mathematical tasks, where naive CoT transfer struggles due to capacity gaps. It introduces ORION, an error-aware self-reflection framework that refines teacher CoTs by leveraging the student’s own errors through error exposure and self-reflection, generating targeted supervision for supervised fine-tuning. Empirical results across GSM-Hard, MATH500, AIME24, and AMC23 show ORION consistently improves accuracy, training stability, and CoT quality, with ablations confirming the complementary roles of error exposure and self-reflection. The approach generalizes across multiple backbones and reduces verbosity of reasoning while boosting correctness, offering a practical pathway to more reliable reasoning distillation, albeit at the cost of dependence on closed-source reasoning LLMs for data generation.

Abstract

Large Language Models (LLMs) have exhibited strong reasoning capabilities and achieved remarkable performance in mathematical problem-solving tasks. Recently, distilling reasoning ability from long-form Chains-of-Thought (CoTs) has emerged as a promising approach for enhancing Small Language Models (SLMs). Existing studies typically treat SLMs as student models and use long-form CoTs as supervision signals for Supervised Fine-Tuning (SFT) to transfer reasoning ability. However, such long-form CoT teachers are usually unaware of the student model's capacity, which limits the effective utilization of the provided reasoning traces. To overcome this limitation, we propose errOr-aware self-ReflectION (ORION), a framework that refines teacher CoTs through an Error-Aware Reflection process. ORION enables the student model to construct more tailored teacher CoTs by refining teacher CoTs and incorporating its own reasoning errors. Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines. Further analysis reveals that the CoTs constructed by ORION exhibit higher coherence and logical consistency, thereby serving as more effective supervision signals for SFT. All codes are available at https://github.com/NEUIR/ORION.git.

Paper Structure

This paper contains 21 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The Framework of Our ORION Model. ORION refines long-form reasoning via self-reflection.
  • Figure 2: Illustration of Our ORION Model.
  • Figure 3: Performance of Distilled Models Optimized with Different Training Strategies. We first report the entropy scores during distillation under various strategies (Figure \ref{['fig:entropy']}). We then present the response lengths generated by the distilled models (Figure \ref{['fig:evaluate-error-aware:length-of-response']}). Finally, we evaluate the quality of CoT and final responses using both Vanilla LLMs and GPT-4 as judges (Figure \ref{['fig:evaluate-error-aware:ppl']} and Figure \ref{['fig:evaluate-error-aware:gpt_score']}, respectively).
  • Figure 4: Analysis of Distilled Models on Different Error Types. All experiments are based on the Qwen3-8B model. Figure \ref{['fig:category:type']} shows the distribution of distinct error types encountered by vanilla LLMs, while Figures \ref{['fig:category:reasoning']}, \ref{['fig:category:calculation']}, and \ref{['fig:category:understanding']} present the corresponding correction rates for each error type.
  • Figure 5: The Prompt Templates Used for Error Exposure.
  • ...and 4 more figures