Table of Contents
Fetching ...

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao

TL;DR

A three-stage curriculum learning framework that addresses capacity mismatch through progressive skill acquisition and establishes structural understanding via masked shuffled reconstruction, and applies Group Relative Policy Optimization on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity.

Abstract

Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy improvement while reducing output length by 27.4 percent, surpassing both instruction-tuned variants and prior distillation methods.

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

TL;DR

A three-stage curriculum learning framework that addresses capacity mismatch through progressive skill acquisition and establishes structural understanding via masked shuffled reconstruction, and applies Group Relative Policy Optimization on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity.

Abstract

Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy improvement while reducing output length by 27.4 percent, surpassing both instruction-tuned variants and prior distillation methods.
Paper Structure (45 sections, 10 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 45 sections, 10 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of BRIDGE. Stage 1 establishes structural understanding through masked shuffled reconstruction. Stage 2 applies GRPO on masked completion tasks to balance accuracy and compression. Stage 3 identifies failure cases, applies teacher-guided rewriting for internalization, and uses GRPO to maintain compression capabilities.
  • Figure 2: Illustration of the Structure-Aware Warmup data construction. We randomly mask one step (with $p=0.7$) and shuffle the sequence to force the student to learn logical dependencies.
  • Figure 3: Illustrative prompt template for the internalization step. The student sees the teacher's complete solution but must express the reasoning in its own concise style. See Appendix \ref{['app:prompts']} for the exact prompt used in our implementation.
  • Figure 4: Output token distribution on GSM8K (Qwen 2.5-3B): Base model vs. BRIDGE.
  • Figure 5: Std-CoT KD falls into repetition loops when overwhelmed, whereas BRIDGE generates concise, correct reasoning.
  • ...and 7 more figures