Precise Knowledge Transfer via Flow Matching

Shitong Shao; Zhiqiang Shen; Linrui Gong; Huanran Chen; Xu Dai

Precise Knowledge Transfer via Flow Matching

Shitong Shao, Zhiqiang Shen, Linrui Gong, Huanran Chen, Xu Dai

TL;DR

FM-KT addresses the challenge of precise teacher-to-student knowledge transfer by leveraging continuous normalizing flows to progressively transform student representations toward the teacher. It introduces a serial training paradigm to prevent information leakage and establishes a theoretical link to the upper bound on the teacher’s negative log-likelihood $- log p_{v\theta}(Z_0)$, providing a rigorous optimization target. The framework is highly flexible, supporting feature- and logit-based distillation with arbitrary meta-encoders and noise schedules, and it includes lightweight ($FM-KT^{\Theta}$) and online ($OFM-KT$) variants. Empirically, FM-KT and OFM-KT deliver state-of-the-art results on CIFAR-100, ImageNet-1k, and MS-COCO, illustrating strong scalability and practical impact for deploying accurate, compressed models in resource-constrained environments.

Abstract

In this paper, we propose a novel knowledge transfer framework that introduces continuous normalizing flows for progressive knowledge transformation and leverages multi-step sampling strategies to achieve precision knowledge transfer. We name this framework Knowledge Transfer with Flow Matching (FM-KT), which can be integrated with a metric-based distillation method with any form (\textit{e.g.} vanilla KD, DKD, PKD and DIST) and a meta-encoder with any available architecture (\textit{e.g.} CNN, MLP and Transformer). By introducing stochastic interpolants, FM-KD is readily amenable to arbitrary noise schedules (\textit{e.g.}, VP-ODE, VE-ODE, Rectified flow) for normalized flow path estimation. We theoretically demonstrate that the training objective of FM-KT is equivalent to minimizing the upper bound of the teacher feature map or logit negative log-likelihood. Besides, FM-KT can be viewed as a unique implicit ensemble method that leads to performance gains. By slightly modifying the FM-KT framework, FM-KT can also be transformed into an online distillation framework OFM-KT with desirable performance gains. Through extensive experiments on CIFAR-100, ImageNet-1k, and MS-COCO datasets, we empirically validate the scalability and state-of-the-art performance of our proposed methods among relevant comparison approaches.

Precise Knowledge Transfer via Flow Matching

TL;DR

, providing a rigorous optimization target. The framework is highly flexible, supporting feature- and logit-based distillation with arbitrary meta-encoders and noise schedules, and it includes lightweight (

) and online (

) variants. Empirically, FM-KT and OFM-KT deliver state-of-the-art results on CIFAR-100, ImageNet-1k, and MS-COCO, illustrating strong scalability and practical impact for deploying accurate, compressed models in resource-constrained environments.

Abstract

Paper Structure (43 sections, 2 theorems, 18 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 43 sections, 2 theorems, 18 equations, 11 figures, 10 tables, 1 algorithm.

Introduction
Preliminaries
Review the Knowledge Transfer.
Continuous Normalized Flows.
Noise Schedules.
Methodology
Serial Training Paradigm
Choice of Noise Schedule
Serve to Feature-/Logit-based Distillation
Approximate to Ensemble
Lightweight FM-KT$^{\Theta}$ without Additional Inference Burden
Translate to Online Knowledge Distillation
Experiments
Image Classification Comparison
Offline Knowledge Distillation.
...and 28 more sections

Key Result

Theorem 3.1

(Proof in Appendix apd:training_paradiam) Optimizing $\mathcal{L}_\textrm{FM-KT}$ not only avoids "cheating" by accessing $X^T$ during training, but also establishes an equivalence to the upper bound of the negative log-likelihood of $X^T$.

Figures (11)

Figure 1: A highly scalable knowledge transfer framework FM-KT.
Figure 2: The overall structure of FM-KT.
Figure 3: Trajectories of Top-1 test accuracy with WRN-40-2-WRN-16-2 pair on CIFAR-100 for various noise schedules: VP ODE, VE ODE, and Rectified flow. Please refer to Appendix \ref{['apd:unified']} for more details.
Figure 4: An example of FM-KT usage.
Figure 5: Results of experiments on the ensemble capabilities of FM-KT on CIFAR-100. The numbers on the bars represent their performance gains compared to Student+Meta-encoder.
...and 6 more figures

Theorems & Definitions (2)

Theorem 3.1
Proposition 3.2

Precise Knowledge Transfer via Flow Matching

TL;DR

Abstract

Precise Knowledge Transfer via Flow Matching

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (2)