CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective

Wencheng Zhu; Xin Zhou; Pengfei Zhu; Yu Wang; Qinghua Hu

CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective

Wencheng Zhu, Xin Zhou, Pengfei Zhu, Yu Wang, Qinghua Hu

TL;DR

This work tackles the limitations of traditional knowledge distillation and contrastive KD by proposing CKD, a sample-wise contrastive framework that enforces intra-sample logit alignment and inter-sample semantic separation using a fixed teacher. It derives a concise, InfoNCE-like loss with positive pairs $(oldsymbol{t}_i,oldsymbol{s}_i)$ and negatives $(oldsymbol{s}_i,oldsymbol{s}_j)$, avoiding reliance on large batches or temperature settings. Across CIFAR-100, ImageNet-1K, Places365, and MS COCO, CKD demonstrates consistent improvements over vanilla KD and competitive results against state-of-the-art methods, with notable gains in heterogeneous teacher-student settings. The approach also shows favorable training efficiency and robust performance when combined with feature-based distillation, suggesting practical applicability to large-scale, cross-architecture knowledge transfer.

Abstract

In this paper, we propose a simple yet effective contrastive knowledge distillation framework that achieves sample-wise logit alignment while preserving semantic consistency. Conventional knowledge distillation approaches exhibit over-reliance on feature similarity per sample, which risks overfitting, and contrastive approaches focus on inter-class discrimination at the expense of intra-sample semantic relationships. Our approach transfers "dark knowledge" through teacher-student contrastive alignment at the sample level. Specifically, our method first enforces intra-sample alignment by directly minimizing teacher-student logit discrepancies within individual samples. Then, we utilize inter-sample contrasts to preserve semantic dissimilarities across samples. By redefining positive pairs as aligned teacher-student logits from identical samples and negative pairs as cross-sample logit combinations, we reformulate these dual constraints into an InfoNCE loss framework, reducing computational complexity lower than sample squares while eliminating dependencies on temperature parameters and large batch sizes. We conduct comprehensive experiments across three benchmark datasets, including the CIFAR-100, ImageNet-1K, and MS COCO datasets, and experimental results clearly confirm the effectiveness of the proposed method on image classification, object detection, and instance segmentation tasks.

CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective

TL;DR

and negatives

, avoiding reliance on large batches or temperature settings. Across CIFAR-100, ImageNet-1K, Places365, and MS COCO, CKD demonstrates consistent improvements over vanilla KD and competitive results against state-of-the-art methods, with notable gains in heterogeneous teacher-student settings. The approach also shows favorable training efficiency and robust performance when combined with feature-based distillation, suggesting practical applicability to large-scale, cross-architecture knowledge transfer.

Abstract

Paper Structure (36 sections, 3 theorems, 20 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 36 sections, 3 theorems, 20 equations, 10 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Knowledge Distillation
Logit Distillation
Feature Distillation
Contrastive Learning
Approach
Notations
Contrastive Knowledge Distillation
Intra-Sample Distillation
Inter-Sample Distillation
Contrastive Formulation
Experiments
Image Classification
Datasets
...and 21 more sections

Key Result

Theorem 1

Given triples $\left(\boldsymbol{t}_i,\boldsymbol{s}_i,\boldsymbol{s}_j\right)$, the loss function $\mathcal{L}_i$ is defined as, The gradients $\nabla_{\boldsymbol{s}_i}\mathcal{L}_i$ and $\nabla_{\boldsymbol{s}_j}\mathcal{L}_i$ are proportional to $g_i$ that is formed as,

Figures (10)

Figure 1: Comparison of three knowledge distillation approaches. Classic KD aligns feature similarities between paired teacher-student samples, while CRD leverages contrastive learning to align class-wise semantics. In contrast, the proposed CKD preserves sample-wise similarity for intra-sample alignment while capturing structural semantics for inter-sample contrast.
Figure 2: The overall architecture of the proposed Contrastive Knowledge Distillation framework. The framework incorporates newly designed triplets to optimize intra-sample feature similarity and cross-sample semantics simultaneously. For each input instance, the teacher and student logits form positive pairs, while other samples within the mini-batch serve as negative pairs. This formulation significantly enhances training efficiency through computationally efficient negative sample reuse, effectively addressing the memory constraints typically associated with contrastive learning approaches.
Figure 3: Visualization of the intra-sample alignment. For the sake of simplicity, we assume that the dimension of logits is 3. The spherical surface represents $\mathbf{s}_i$ satisfying the condition $\epsilon_i=\mathbf{t}_i-\mathbf{s}_i$ for the fixed $\epsilon_i$.
Figure 4: Visualization of two categories of triples. (a) depicts $\left(\boldsymbol{s}_i,\boldsymbol{t}_i,\boldsymbol{s}_j\right)$, while (b) and (c) show $\left(\boldsymbol{s}_i,\boldsymbol{t}_i,\boldsymbol{t}_j\right)$. The blue line indicates that $\boldsymbol{s}_i$ approaches $\boldsymbol{t}_i$, and the red line represents that $\boldsymbol{s}_i$ is far from $\boldsymbol{s}_j$ or $\boldsymbol{t}_j$.
Figure 5: Pseudo code of CKD in a Numpy-like style.
...and 5 more figures

Theorems & Definitions (6)

Theorem 1
proof
Theorem 2
proof
Theorem 3
proof

CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective

TL;DR

Abstract

CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (6)