Direct Distillation between Different Domains
Jialiang Tang, Shuo Chen, Gang Niu, Hongyuan Zhu, Joey Tianyi Zhou, Chen Gong, Masashi Sugiyama
TL;DR
The paper tackles knowledge distillation when the student must operate in a target domain different from the teacher's source domain, a setting where traditional two-stage domain adaptation plus KD is costly and prone to error accumulation. It introduces 4Ds, a one-stage, data-free framework that uses a Fourier-transform–based adapter to decouple domain-invariant semantics from domain-specific styles in the teacher and a fusion-activation module to transfer the invariant knowledge to a smaller student, while the adapter learns target-domain specifics. The approach yields a lightweight, student-friendly teacher through minimal adapter parameters (about 2% of the teacher) and eliminates the need to access source data, achieving superior results over state-of-the-art KD and DA baselines across multiple benchmarks. Overall, 4Ds offers a practical, efficient solution for cross-domain model compression with strong empirical performance and data privacy advantages.
Abstract
Knowledge Distillation (KD) aims to learn a compact student network using knowledge from a large pre-trained teacher network, where both networks are trained on data from the same distribution. However, in practical applications, the student network may be required to perform in a new scenario (i.e., the target domain), which usually exhibits significant differences from the known scenario of the teacher network (i.e., the source domain). The traditional domain adaptation techniques can be integrated with KD in a two-stage process to bridge the domain gap, but the ultimate reliability of two-stage approaches tends to be limited due to the high computational consumption and the additional errors accumulated from both stages. To solve this problem, we propose a new one-stage method dubbed ``Direct Distillation between Different Domains" (4Ds). We first design a learnable adapter based on the Fourier transform to separate the domain-invariant knowledge from the domain-specific knowledge. Then, we build a fusion-activation mechanism to transfer the valuable domain-invariant knowledge to the student network, while simultaneously encouraging the adapter within the teacher network to learn the domain-specific knowledge of the target data. As a result, the teacher network can effectively transfer categorical knowledge that aligns with the target domain of the student network. Intensive experiments on various benchmark datasets demonstrate that our proposed 4Ds method successfully produces reliable student networks and outperforms state-of-the-art approaches.
