Table of Contents
Fetching ...

Direct Distillation between Different Domains

Jialiang Tang, Shuo Chen, Gang Niu, Hongyuan Zhu, Joey Tianyi Zhou, Chen Gong, Masashi Sugiyama

TL;DR

The paper tackles knowledge distillation when the student must operate in a target domain different from the teacher's source domain, a setting where traditional two-stage domain adaptation plus KD is costly and prone to error accumulation. It introduces 4Ds, a one-stage, data-free framework that uses a Fourier-transform–based adapter to decouple domain-invariant semantics from domain-specific styles in the teacher and a fusion-activation module to transfer the invariant knowledge to a smaller student, while the adapter learns target-domain specifics. The approach yields a lightweight, student-friendly teacher through minimal adapter parameters (about 2% of the teacher) and eliminates the need to access source data, achieving superior results over state-of-the-art KD and DA baselines across multiple benchmarks. Overall, 4Ds offers a practical, efficient solution for cross-domain model compression with strong empirical performance and data privacy advantages.

Abstract

Knowledge Distillation (KD) aims to learn a compact student network using knowledge from a large pre-trained teacher network, where both networks are trained on data from the same distribution. However, in practical applications, the student network may be required to perform in a new scenario (i.e., the target domain), which usually exhibits significant differences from the known scenario of the teacher network (i.e., the source domain). The traditional domain adaptation techniques can be integrated with KD in a two-stage process to bridge the domain gap, but the ultimate reliability of two-stage approaches tends to be limited due to the high computational consumption and the additional errors accumulated from both stages. To solve this problem, we propose a new one-stage method dubbed ``Direct Distillation between Different Domains" (4Ds). We first design a learnable adapter based on the Fourier transform to separate the domain-invariant knowledge from the domain-specific knowledge. Then, we build a fusion-activation mechanism to transfer the valuable domain-invariant knowledge to the student network, while simultaneously encouraging the adapter within the teacher network to learn the domain-specific knowledge of the target data. As a result, the teacher network can effectively transfer categorical knowledge that aligns with the target domain of the student network. Intensive experiments on various benchmark datasets demonstrate that our proposed 4Ds method successfully produces reliable student networks and outperforms state-of-the-art approaches.

Direct Distillation between Different Domains

TL;DR

The paper tackles knowledge distillation when the student must operate in a target domain different from the teacher's source domain, a setting where traditional two-stage domain adaptation plus KD is costly and prone to error accumulation. It introduces 4Ds, a one-stage, data-free framework that uses a Fourier-transform–based adapter to decouple domain-invariant semantics from domain-specific styles in the teacher and a fusion-activation module to transfer the invariant knowledge to a smaller student, while the adapter learns target-domain specifics. The approach yields a lightweight, student-friendly teacher through minimal adapter parameters (about 2% of the teacher) and eliminates the need to access source data, achieving superior results over state-of-the-art KD and DA baselines across multiple benchmarks. Overall, 4Ds offers a practical, efficient solution for cross-domain model compression with strong empirical performance and data privacy advantages.

Abstract

Knowledge Distillation (KD) aims to learn a compact student network using knowledge from a large pre-trained teacher network, where both networks are trained on data from the same distribution. However, in practical applications, the student network may be required to perform in a new scenario (i.e., the target domain), which usually exhibits significant differences from the known scenario of the teacher network (i.e., the source domain). The traditional domain adaptation techniques can be integrated with KD in a two-stage process to bridge the domain gap, but the ultimate reliability of two-stage approaches tends to be limited due to the high computational consumption and the additional errors accumulated from both stages. To solve this problem, we propose a new one-stage method dubbed ``Direct Distillation between Different Domains" (4Ds). We first design a learnable adapter based on the Fourier transform to separate the domain-invariant knowledge from the domain-specific knowledge. Then, we build a fusion-activation mechanism to transfer the valuable domain-invariant knowledge to the student network, while simultaneously encouraging the adapter within the teacher network to learn the domain-specific knowledge of the target data. As a result, the teacher network can effectively transfer categorical knowledge that aligns with the target domain of the student network. Intensive experiments on various benchmark datasets demonstrate that our proposed 4Ds method successfully produces reliable student networks and outperforms state-of-the-art approaches.
Paper Structure (13 sections, 14 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 14 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Comparison between existing Knowledge Distillation (KD) methods and our Direct Distillation between Different Domains (4Ds). (a) Vanilla KD transfers knowledge from a fixed pre-trained teacher network to the student network where the data is identically distributed. (b) Our 4Ds trains the student network on the target data using the teacher network (with the learnable adapters) trained on the source data. "Adaptation after distillation" in (c) first learns a student network on the source data via KD and then generalizes the student network on the target data via Domain Adaptation (DA). "Distillation after adaptation" in (d) first adapts the teacher network trained on the source data to the target data via DA and then utilizes the adapted teacher network to guide the student network training on the target data via KD.
  • Figure 2: The diagram of our proposed 4Ds. (a) The teacher network $\mathcal{N}_{T}$ (ResNet34) and student network $\mathcal{N}_{S}$ (ResNet18) consist of four blocks, where each block further contains several ResBlocks. During training, both $\mathcal{N}_{T}$ and $\mathcal{N}_{S}$ interactively learn from the target data, where $\mathcal{N}_{T}$ is encouraged to produce accurate and useful category relations for $\mathcal{N}_{S}$ by updating its imposed adapters. Meanwhile, $\mathcal{N}_{S}$ is promoted to learn the valuable domain-invariant features as well as the reliable category relations from $\mathcal{N}_{T}$. (b) In our designed adapter, the input feature $\mathbf{f}^{T}$ is first fed into two learnable convolution layers to grasp the target-domain-specific knowledge. Subsequently, the original domain-specific knowledge is refurbished by mixing the amplitudes $\boldsymbol{\alpha}^{T}$ and $\boldsymbol{\alpha}^{T}_{\text{ad}}$, which are decoupled from the original $\mathbf{f}^{T}$ and adapted $\mathbf{f}^{T}_{\text{ad}}$, respectively. Finally, the output feature $\mathbf{f}^{T}_{\text{ift}}$ is recovered from the remained phase $\boldsymbol{\rho}^{T}$ from $\mathbf{f}^{T}$ and refurbished amplitude $\boldsymbol{\alpha}_{\text{ref}}^{T}$. (c) The input source images and target images are decoupled into phases and amplitudes by the Fourier transform and decoupling operations.