Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation
Bowei He, Yankai Chen, Xiaokun Zhang, Linghe Kong, Philip S. Yu, Xue Liu, Chen Ma
TL;DR
The paper tackles the challenge of distilling knowledge from large language models to smaller ones by introducing IOA, a pedagogy-inspired three-stage data-synthesis framework. IOA identifies knowledge gaps (Identifier), organizes gradual learning through topology-guided curricula with mastery gating and Zone of Proximal Development constraints (Organizer), and adapts data representations to the learner's cognitive capacity (Adapter). The approach is formalized with explicit metrics, dependency graphs, and stage-wise progression; experiments show IOA yields consistent gains across instruction-following and reasoning benchmarks and demonstrates robustness and efficiency relative to baselines. The work demonstrates that pedagogy-inspired curriculum design can substantially improve both effectiveness and efficiency in LLM knowledge distillation, with practical implications for accessible, resource-efficient AI development.
Abstract
Knowledge distillation from Large Language Models (LLMs) to smaller models has emerged as a critical technique for deploying efficient AI systems. However, current methods for distillation via synthetic data lack pedagogical awareness, treating knowledge transfer as a one-off data synthesis and training task rather than a systematic learning process. In this paper, we propose a novel pedagogically-inspired framework for LLM knowledge distillation that draws from fundamental educational principles. Our approach introduces a three-stage pipeline -- Knowledge Identifier, Organizer, and Adapter (IOA) -- that systematically identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match the cognitive capacity of student models. We integrate Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to create a dynamic distillation process where student models approach teacher model's performance on prerequisite knowledge before advancing, and new knowledge is introduced with controlled, gradual difficulty increments. Extensive experiments using LLaMA-3.1/3.2 and Qwen2.5 as student models demonstrate that IOA achieves significant improvements over baseline distillation methods, with student models retaining 94.7% of teacher performance on DollyEval while using less than 1/10th of the parameters. Our framework particularly excels in complex reasoning tasks, showing 19.2% improvement on MATH and 22.3% on HumanEval compared with state-of-the-art baselines.
