Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection

Zhuoyang Wu; Xinze Li; Zhenghao Liu; Yukun Yan; Zhiyuan Liu; Minghe Yu; Cheng Yang; Yu Gu; Ge Yu; Maosong Sun

Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection

Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, Maosong Sun

TL;DR

The paper tackles the challenge of distilling long-form reasoning from powerful teachers to small language models for mathematical tasks, where naive CoT transfer struggles due to capacity gaps. It introduces ORION, an error-aware self-reflection framework that refines teacher CoTs by leveraging the student’s own errors through error exposure and self-reflection, generating targeted supervision for supervised fine-tuning. Empirical results across GSM-Hard, MATH500, AIME24, and AMC23 show ORION consistently improves accuracy, training stability, and CoT quality, with ablations confirming the complementary roles of error exposure and self-reflection. The approach generalizes across multiple backbones and reduces verbosity of reasoning while boosting correctness, offering a practical pathway to more reliable reasoning distillation, albeit at the cost of dependence on closed-source reasoning LLMs for data generation.

Abstract

Large Language Models (LLMs) have exhibited strong reasoning capabilities and achieved remarkable performance in mathematical problem-solving tasks. Recently, distilling reasoning ability from long-form Chains-of-Thought (CoTs) has emerged as a promising approach for enhancing Small Language Models (SLMs). Existing studies typically treat SLMs as student models and use long-form CoTs as supervision signals for Supervised Fine-Tuning (SFT) to transfer reasoning ability. However, such long-form CoT teachers are usually unaware of the student model's capacity, which limits the effective utilization of the provided reasoning traces. To overcome this limitation, we propose errOr-aware self-ReflectION (ORION), a framework that refines teacher CoTs through an Error-Aware Reflection process. ORION enables the student model to construct more tailored teacher CoTs by refining teacher CoTs and incorporating its own reasoning errors. Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines. Further analysis reveals that the CoTs constructed by ORION exhibit higher coherence and logical consistency, thereby serving as more effective supervision signals for SFT. All codes are available at https://github.com/NEUIR/ORION.git.

Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection

TL;DR

Abstract

Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)