Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation

Bowei He; Yankai Chen; Xiaokun Zhang; Linghe Kong; Philip S. Yu; Xue Liu; Chen Ma

Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation

Bowei He, Yankai Chen, Xiaokun Zhang, Linghe Kong, Philip S. Yu, Xue Liu, Chen Ma

TL;DR

The paper tackles the challenge of distilling knowledge from large language models to smaller ones by introducing IOA, a pedagogy-inspired three-stage data-synthesis framework. IOA identifies knowledge gaps (Identifier), organizes gradual learning through topology-guided curricula with mastery gating and Zone of Proximal Development constraints (Organizer), and adapts data representations to the learner's cognitive capacity (Adapter). The approach is formalized with explicit metrics, dependency graphs, and stage-wise progression; experiments show IOA yields consistent gains across instruction-following and reasoning benchmarks and demonstrates robustness and efficiency relative to baselines. The work demonstrates that pedagogy-inspired curriculum design can substantially improve both effectiveness and efficiency in LLM knowledge distillation, with practical implications for accessible, resource-efficient AI development.

Abstract

Knowledge distillation from Large Language Models (LLMs) to smaller models has emerged as a critical technique for deploying efficient AI systems. However, current methods for distillation via synthetic data lack pedagogical awareness, treating knowledge transfer as a one-off data synthesis and training task rather than a systematic learning process. In this paper, we propose a novel pedagogically-inspired framework for LLM knowledge distillation that draws from fundamental educational principles. Our approach introduces a three-stage pipeline -- Knowledge Identifier, Organizer, and Adapter (IOA) -- that systematically identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match the cognitive capacity of student models. We integrate Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to create a dynamic distillation process where student models approach teacher model's performance on prerequisite knowledge before advancing, and new knowledge is introduced with controlled, gradual difficulty increments. Extensive experiments using LLaMA-3.1/3.2 and Qwen2.5 as student models demonstrate that IOA achieves significant improvements over baseline distillation methods, with student models retaining 94.7% of teacher performance on DollyEval while using less than 1/10th of the parameters. Our framework particularly excels in complex reasoning tasks, showing 19.2% improvement on MATH and 22.3% on HumanEval compared with state-of-the-art baselines.

Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation

TL;DR

Abstract

Paper Structure (49 sections, 10 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 49 sections, 10 equations, 6 figures, 10 tables, 1 algorithm.

Introduction
Related Works
Methodology
Problem Formulation
Identifier: Knowledge Deficiency Diagnosis and Targeting
Organizer: Progressive Curriculum Design with Mastery Learning
Adapter: Knowledge Representation Adaptation for Cognitive Alignment
Overall Knowledge Distillation Framework with Data Synthesis
Experiments
Experiment Setup
Main Results and Analysis
Supplementary Results and Analysis
Conclusions and Future Works
Pedagogical Foundations and How They Shape IOA
Theoretical Background
...and 34 more sections

Figures (6)

Figure 1: Analogy between real education and language model knowledge distillation.
Figure 2: Pedagogically-inspired data synthesis framework for language model knowledge distillation.
Figure 3: The hyperparameter robustness analysis for three critical hyperparameters $J_i$, $\tau_{\text{ZPD}}$, $\tau_{\text{mastery}}$.
Figure 4: Time consumption comparison between our IOA and baselines.
Figure 5: The hyperparameter robustness analysis for other five critical hyperparameters: $\tau_{\text{gap}}$, $\tau_{\text{high}}$, $\tau_{\text{low}}$, $\tau_{\text{dep}}$, $\alpha$ in the Identifier component of our IOA framework.
...and 1 more figures

Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation

TL;DR

Abstract

Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)