CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective
Wencheng Zhu, Xin Zhou, Pengfei Zhu, Yu Wang, Qinghua Hu
TL;DR
This work tackles the limitations of traditional knowledge distillation and contrastive KD by proposing CKD, a sample-wise contrastive framework that enforces intra-sample logit alignment and inter-sample semantic separation using a fixed teacher. It derives a concise, InfoNCE-like loss with positive pairs $(oldsymbol{t}_i,oldsymbol{s}_i)$ and negatives $(oldsymbol{s}_i,oldsymbol{s}_j)$, avoiding reliance on large batches or temperature settings. Across CIFAR-100, ImageNet-1K, Places365, and MS COCO, CKD demonstrates consistent improvements over vanilla KD and competitive results against state-of-the-art methods, with notable gains in heterogeneous teacher-student settings. The approach also shows favorable training efficiency and robust performance when combined with feature-based distillation, suggesting practical applicability to large-scale, cross-architecture knowledge transfer.
Abstract
In this paper, we propose a simple yet effective contrastive knowledge distillation framework that achieves sample-wise logit alignment while preserving semantic consistency. Conventional knowledge distillation approaches exhibit over-reliance on feature similarity per sample, which risks overfitting, and contrastive approaches focus on inter-class discrimination at the expense of intra-sample semantic relationships. Our approach transfers "dark knowledge" through teacher-student contrastive alignment at the sample level. Specifically, our method first enforces intra-sample alignment by directly minimizing teacher-student logit discrepancies within individual samples. Then, we utilize inter-sample contrasts to preserve semantic dissimilarities across samples. By redefining positive pairs as aligned teacher-student logits from identical samples and negative pairs as cross-sample logit combinations, we reformulate these dual constraints into an InfoNCE loss framework, reducing computational complexity lower than sample squares while eliminating dependencies on temperature parameters and large batch sizes. We conduct comprehensive experiments across three benchmark datasets, including the CIFAR-100, ImageNet-1K, and MS COCO datasets, and experimental results clearly confirm the effectiveness of the proposed method on image classification, object detection, and instance segmentation tasks.
