Table of Contents
Fetching ...

Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

Yong Guo, Shulian Zhang, Haolin Pan, Jing Liu, Yulun Zhang, Jian Chen

TL;DR

Gap Preserving Distillation (GPD) tackles the problem of diminishing knowledge transfer when a fixed, powerful teacher becomes too far ahead of a compact student. It introduces a trainable dynamic teacher (DT) constructed from the student via Inverse Reparameterization (IR) and strengthens transfer through Channel-Branch Reparameterization (CBR) and a hard parameter-sharing strategy. The method optimizes a gap-preserving objective in which the dynamic teacher guides the student while being guided by the static teacher, enabling bidirectional knowledge flow and direct parameter inheritance. Experiments on ImageNet demonstrate consistent improvements across CNNs and ViTs in both training-from-scratch and fine-tuning settings, with notable gains in scenarios lacking a pre-trained teacher and modest overhead relative to traditional KD.

Abstract

Knowledge distillation aims to transfer knowledge from a large teacher model to a compact student counterpart, often coming with a significant performance gap between them. We find that a too-large performance gap can hamper the training process, which is also verified in recent studies. To address this, we propose a Gap Preserving Distillation (GPD) method that trains an additional dynamic teacher model from scratch along with training the student to bridge this gap. In this way, it becomes possible to maintain a reasonable performance gap between teacher and student during the whole distillation process. To further strengthen distillation from the dynamic teacher to the student, we develop a hard strategy by enforcing them to share parameters and encouraging parameter inheritance. Besides hard strategy, we also build the soft bidirectional mappings between them which are built on an Inverse Reparameterization (IR) method and a Channel-Branch Reparameterization (CBR) strategy. We highlight that our IR is able to initialize a larger dynamic teacher with an arbitrary expansion ratio, while preserving exactly the same accuracy as the given student model. In this way, it guarantees that the dynamic teacher and student start from the same point and avoid a too large gap in early stage of training. As for our CBR, with parameter-sharing, it directly extracts an effective student model from the well-learned dynamic teacher without any post-training, making our method highly flexible for model deployment. In the experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers architectures, achieving up to 1.58% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pre-trained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on ResNet18, respectively.

Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

TL;DR

Gap Preserving Distillation (GPD) tackles the problem of diminishing knowledge transfer when a fixed, powerful teacher becomes too far ahead of a compact student. It introduces a trainable dynamic teacher (DT) constructed from the student via Inverse Reparameterization (IR) and strengthens transfer through Channel-Branch Reparameterization (CBR) and a hard parameter-sharing strategy. The method optimizes a gap-preserving objective in which the dynamic teacher guides the student while being guided by the static teacher, enabling bidirectional knowledge flow and direct parameter inheritance. Experiments on ImageNet demonstrate consistent improvements across CNNs and ViTs in both training-from-scratch and fine-tuning settings, with notable gains in scenarios lacking a pre-trained teacher and modest overhead relative to traditional KD.

Abstract

Knowledge distillation aims to transfer knowledge from a large teacher model to a compact student counterpart, often coming with a significant performance gap between them. We find that a too-large performance gap can hamper the training process, which is also verified in recent studies. To address this, we propose a Gap Preserving Distillation (GPD) method that trains an additional dynamic teacher model from scratch along with training the student to bridge this gap. In this way, it becomes possible to maintain a reasonable performance gap between teacher and student during the whole distillation process. To further strengthen distillation from the dynamic teacher to the student, we develop a hard strategy by enforcing them to share parameters and encouraging parameter inheritance. Besides hard strategy, we also build the soft bidirectional mappings between them which are built on an Inverse Reparameterization (IR) method and a Channel-Branch Reparameterization (CBR) strategy. We highlight that our IR is able to initialize a larger dynamic teacher with an arbitrary expansion ratio, while preserving exactly the same accuracy as the given student model. In this way, it guarantees that the dynamic teacher and student start from the same point and avoid a too large gap in early stage of training. As for our CBR, with parameter-sharing, it directly extracts an effective student model from the well-learned dynamic teacher without any post-training, making our method highly flexible for model deployment. In the experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers architectures, achieving up to 1.58% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pre-trained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on ResNet18, respectively.
Paper Structure (25 sections, 8 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 8 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the proposed Gap Preserving Distillation (GPD) method. Besides the static teacher, we introduce an additional dynamic teacher and train it from scratch along with the student. The student model shares parameters with the dynamic teacher via Inverse Reparameterization (IR) and Channel-Branch Reparameterization (CBR). (a) The dynamic teacher is constructed through IR (top right) from the student model. For any layer, we replicate the weights along the channel dimension to build a wider layer while introducing additional branches to construct a multi-branch architecture. In order to maintain the same accuracy as the student, we only activate the first branch that contains the original student weights and zero out all the other extra branches, i.e., one-scaling and zero-scaling. (b) We extract a promising student from the dynamic teacher via CBR. The expanded multi-branch architecture can be merged into the student's single-branch architecture using a similar way proposed by OREPA DBLP:journals/corr/abs-2204-00826. After that, given an expansion ratio $r$, we directly extract the first $1/r$ parameters multiplied by a scaling factor (see details in Section \ref{['sec:channel_level_reparam']}).
  • Figure 2: Illustration of channel-level inverse reparameterization with an expansion ratio of 2. (a) For the first layer, weights are scaled by 2 and replicated along the output channel dimension, expanding from $C_1 \times C_0$ to $2C_1 \times C_0$. For intermediate layers, weights are scaled by 2, then replicated along both input and output dimensions, expanding from $C_l \times C_{l-1}$ to $2C_l \times 2C_{l-1}$. For the last layer, weights are replicated along the input dimension, expanding from $C_L \times C_{L-1}$ to $C_L \times 2C_{L-1}$. (b) Inverse re-parameterizing the student model (left) to construct the dynamic teacher model (right) by expanding channels from 2 to 4 following the procedures exemplified in (a), while preserving the initial input-output mapping.
  • Figure 3: Illustration of the forward process for the student and dynamic teacher models. The dynamic teacher performs a direct forward pass, utilizing its increased capacity. The student model shares all parameters from the dynamic teacher and undergoes a two-step reparameterization process. First, channel-level reparameterization adjusts the expanded channels to match the original channel dimensions of the student model. Second, branch-level reparameterization merges the expanded multi-branch units into a single branch structure, thereby restoring the original topology of the student model while inheriting knowledge from the dynamic teacher.
  • Figure 4: Impact of branch expansion number and channel expansion ratio on model accuracy. Performance gains are shown above the baseline (DKD). Left: Increasing $M$ to 6 yields significant improvements, with diminishing returns beyond that. Right: Channel expansion ratios of 2 and 3 show substantial gains, while a ratio of 4 leads to degradation.
  • Figure 5: Illustration of parameter sharing for batch normalization in the student and dynamic teacher models. We main a separate set of running statistics for each model due to the distribution difference.