Table of Contents
Fetching ...

Discriminative and Consistent Representation Distillation

Nikolaos Giakoumoglou, Tania Stathaki

TL;DR

Knowledge Distillation (KD) transfers knowledge from a large teacher to a smaller student, yet prior contrastive KD methods emphasize discrimination while neglecting the teacher–student structural relationships. Discriminative and Consistent Distillation (DCD) jointly optimizes a discriminative instance-level contrastive objective and a consistency regularization to align the teacher and student distributions, enhanced by memory-free in-batch negatives and learnable temperature and bias. The final objective combines supervised loss, KD, and the discriminative-consistent term, with $ ext{L}_{ ext{kd}} = ext{L}_{ ext{contrast}} + ext{alpha} ext{L}_{ ext{consist}}$ and $ ext{L} = ext{L}_{ ext{sup}} + ext{lambda} ext{L}_{ ext{distill}} + ext{beta L}_{ ext{kd}}$, enabling dynamic balancing during training. Empirically, DCD achieves state-of-the-art results on CIFAR-100 and ImageNet, improves transferability to Tiny ImageNet and STL-10, and reduces memory overhead compared to memory-bank-based methods, demonstrating strong generalization across architectures and datasets.

Abstract

Knowledge Distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. While contrastive learning has shown promise in self-supervised learning by creating discriminative representations, its application in knowledge distillation remains limited and focuses primarily on discrimination, neglecting the structural relationships captured by the teacher model. To address this limitation, we propose Discriminative and Consistent Distillation (DCD), which employs a contrastive loss along with a consistency regularization to minimize the discrepancy between the distributions of teacher and student representations. Our method introduces learnable temperature and bias parameters that adapt during training to balance these complementary objectives, replacing the fixed hyperparameters commonly used in contrastive learning approaches. Through extensive experiments on CIFAR-100 and ImageNet ILSVRC-2012, we demonstrate that DCD achieves state-of-the-art performance, with the student model sometimes surpassing the teacher's accuracy. Furthermore, we show that DCD's learned representations exhibit superior cross-dataset generalization when transferred to Tiny ImageNet and STL-10.

Discriminative and Consistent Representation Distillation

TL;DR

Knowledge Distillation (KD) transfers knowledge from a large teacher to a smaller student, yet prior contrastive KD methods emphasize discrimination while neglecting the teacher–student structural relationships. Discriminative and Consistent Distillation (DCD) jointly optimizes a discriminative instance-level contrastive objective and a consistency regularization to align the teacher and student distributions, enhanced by memory-free in-batch negatives and learnable temperature and bias. The final objective combines supervised loss, KD, and the discriminative-consistent term, with and , enabling dynamic balancing during training. Empirically, DCD achieves state-of-the-art results on CIFAR-100 and ImageNet, improves transferability to Tiny ImageNet and STL-10, and reduces memory overhead compared to memory-bank-based methods, demonstrating strong generalization across architectures and datasets.

Abstract

Knowledge Distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. While contrastive learning has shown promise in self-supervised learning by creating discriminative representations, its application in knowledge distillation remains limited and focuses primarily on discrimination, neglecting the structural relationships captured by the teacher model. To address this limitation, we propose Discriminative and Consistent Distillation (DCD), which employs a contrastive loss along with a consistency regularization to minimize the discrepancy between the distributions of teacher and student representations. Our method introduces learnable temperature and bias parameters that adapt during training to balance these complementary objectives, replacing the fixed hyperparameters commonly used in contrastive learning approaches. Through extensive experiments on CIFAR-100 and ImageNet ILSVRC-2012, we demonstrate that DCD achieves state-of-the-art performance, with the student model sometimes surpassing the teacher's accuracy. Furthermore, we show that DCD's learned representations exhibit superior cross-dataset generalization when transferred to Tiny ImageNet and STL-10.
Paper Structure (36 sections, 8 equations, 5 figures, 4 tables)

This paper contains 36 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of DCD. (a) Discriminative learning through contrastive distillation encourages student features (solid blue) to differentiate between instances by pulling them closer to their corresponding teacher features (transparent blue) while pushing away from other instances as negative samples (black dots). (b) Structural consistency through consistency regularization preserves the distributional relationship patterns captured by the teacher model by aligning the student and teacher feature similarities (represented by dotted lines) through KL divergence minimization.
  • Figure 2: Correlation matrix of the average logit difference between teacher and student logits on CIFAR-100. We use WRN-40-2 as the teacher and WRN-40-1 as the student. Methods have been re-implemented according to tian2022crd.
  • Figure 3: t-SNE visualizations of embeddings from teacher and student networks on CIFAR-100 (first 20 classes). We use WRN-40-2 as the teacher and WRN-40-1 as the student. Methods have been re-implemented according to tian2022crd.
  • Figure 4: Ablation study results on CIFAR-100. We show results for discriminative training ($\alpha=0$), discriminative and consistent training ($\alpha=0.5$, $\tau=0.07$, $b=0$), and our proposed DCD approach ($\alpha=0.5$, trainable $\tau$ and $b$). The colors correspond to each respective variant. (a) compares DCD variants without knowledge distillation, while (b) shows improvements when combined with KD. Results are based on a single run.
  • Figure 5: Ablation study results on CIFAR-100 using WRN-40-2 as the teacher and WRN-16-2 as the student. (a) Effect of the internal DCD coefficient $\alpha$ in \ref{['eq:internaleq']}. (b) Effect of DCD loss coefficient $\beta$ in \ref{['eq:finalloss']}. (c) Effect of loss coefficient $\lambda$ in \ref{['eq:finalloss']}. Results are averaged over five runs.