Table of Contents
Fetching ...

SODA: Semi On-Policy Black-Box Distillation for Large Language Models

Xiwen Chen, Jingjing Wang, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hejian Sang, Zhipeng Wang, Alborz Geramifard, Feng Luo

Abstract

Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student's inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model's natural, zero-shot responses are almost strictly inferior to the powerful teacher's targets, we can construct a highly effective contrastive signal simply by pairing the teacher's optimal response with a one-time static snapshot of the student's outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.

SODA: Semi On-Policy Black-Box Distillation for Large Language Models

Abstract

Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student's inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model's natural, zero-shot responses are almost strictly inferior to the powerful teacher's targets, we can construct a highly effective contrastive signal simply by pairing the teacher's optimal response with a one-time static snapshot of the student's outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.

Paper Structure

This paper contains 18 sections, 8 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: SODA achieves competitive or better distillation quality than GAD: 10$\times$ faster and 27% more memory-efficient, while being substantially easier and more stable to train. From left to right: GPT-4o Score averaged over four student models (higher is better; 50 denotes GPT-4o parity); training stability; wall-clock training time; and peak GPU memory.
  • Figure 2: Dual learning signal in SODA. (a) A brief warmup shifts the student toward the teacher via imitation, but residual modes from $q_0$ persist. (b) Preference-based distillation additionally suppresses these student-specific inferior behaviors via mode pruning, yielding $q_{\text{SODA}} \approx p$.
  • Figure 3: Representation analysis on Llama-3.1-8B-Instruct (200 held-out LMSYS prompts). (a) Layer-wise CKA similarity to the base model: SODA diverges most, indicating deeper representational restructuring. (b, c) Last-layer activation entropy and kurtosis: SODA achieves the highest entropy and lowest kurtosis, correlating with its strongest distillation performance.