Table of Contents
Fetching ...

Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation

Yijia Luo, Yulin Song, Xingyao Zhang, Jiaheng Liu, Weixun Wang, GengRu Chen, Wenbo Su, Bo Zheng

TL;DR

This work examines why long chain-of-thought data distillation transfers reasoning capabilities across LLMs and argues that universality of distillation data is limited. It introduces the DLCoT framework to deconstruct and optimize long CoT data through segmentation, redundancy elimination, and error correction, emphasizing the reasoning trunk. Empirical results show at least a 5% gain in token efficiency and improved accuracy across benchmarks, with distinct behaviors across teacher models (R1 vs QwQ) and student architectures (Qwen vs Llama). A core finding is that preserving diverse trunk approaches—while pruning redundant paths—boosts reasoning transfer more than detailing every individual step. The approach offers a practical path toward training high-performing LLMs for complex reasoning tasks with lower computational cost.

Abstract

Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities through long chain-of-thought (CoT) reasoning. The R1 distillation scheme has emerged as a promising approach for training cost-effective models with enhanced reasoning abilities. However, the underlying mechanisms driving its effectiveness remain unclear. This study examines the universality of distillation data and identifies key components that enable the efficient transfer of long-chain reasoning capabilities in LLM distillation. Our findings reveal that the effectiveness of long CoT reasoning distillation from teacher models like Qwen-QwQ degrades significantly on nonhomologous models, challenging the assumed universality of current distillation methods. To gain deeper insights into the structure and patterns of long CoT reasoning, we propose DLCoT (Deconstructing Long Chain-of-Thought), a distillation data enhancement framework. DLCoT consists of three key steps: (1) data segmentation to decompose complex long CoT structures, (2) simplification by eliminating unsolvable and redundant solutions, and (3) optimization of intermediate error states. Our approach significantly improves model performance and token efficiency, facilitating the development of high-performance LLMs.

Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation

TL;DR

This work examines why long chain-of-thought data distillation transfers reasoning capabilities across LLMs and argues that universality of distillation data is limited. It introduces the DLCoT framework to deconstruct and optimize long CoT data through segmentation, redundancy elimination, and error correction, emphasizing the reasoning trunk. Empirical results show at least a 5% gain in token efficiency and improved accuracy across benchmarks, with distinct behaviors across teacher models (R1 vs QwQ) and student architectures (Qwen vs Llama). A core finding is that preserving diverse trunk approaches—while pruning redundant paths—boosts reasoning transfer more than detailing every individual step. The approach offers a practical path toward training high-performing LLMs for complex reasoning tasks with lower computational cost.

Abstract

Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities through long chain-of-thought (CoT) reasoning. The R1 distillation scheme has emerged as a promising approach for training cost-effective models with enhanced reasoning abilities. However, the underlying mechanisms driving its effectiveness remain unclear. This study examines the universality of distillation data and identifies key components that enable the efficient transfer of long-chain reasoning capabilities in LLM distillation. Our findings reveal that the effectiveness of long CoT reasoning distillation from teacher models like Qwen-QwQ degrades significantly on nonhomologous models, challenging the assumed universality of current distillation methods. To gain deeper insights into the structure and patterns of long CoT reasoning, we propose DLCoT (Deconstructing Long Chain-of-Thought), a distillation data enhancement framework. DLCoT consists of three key steps: (1) data segmentation to decompose complex long CoT structures, (2) simplification by eliminating unsolvable and redundant solutions, and (3) optimization of intermediate error states. Our approach significantly improves model performance and token efficiency, facilitating the development of high-performance LLMs.

Paper Structure

This paper contains 26 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparative Analysis of Model Distillation Performance Using 6K NuminaMath Correct Answer data. Upper Panel: Qwen and Llama distilled through R1-generated data. Notably, while these fail to fully replicate the ability R1 reported, the Qwen2.5-14B achieves comparable accuracy to the QwQ-distilled Qwen2.5-32B, demonstrating R1's enhanced cross-model transferability in distillation scenarios. Lower Panel: The Qwen-family models demonstrate superior distillation efficacy compared to Llama when using data generated by QwQ. The Qwen2.5-32B achieves performance parity with the QwQ.
  • Figure 2: Figure shows an example of QwQ and R1 on "Simplify $2y + 3y + 4y$". Different color-coded text blocks represent distinct solution/verification types. In the middle are three special structure we figure out from long CoT data.
  • Figure 3: The workflow for DLCoT. It involves five steps: (1) Macro-Structure Parsing, (2) Approach & Verification Parsing, (3) Redundancy Analysis, (4) Optimized Integration, and (5) Coherence Reconstruction.
  • Figure 4: Average Cluster Number v.s. Average Try Number per Cluster.
  • Figure 5: Average token output of various distillation models on AIME2024, MATH500 and GSM8K.
  • ...and 1 more figures