Contrastive Representation Distillation via Multi-Scale Feature Decoupling
Cuipeng Wang, Haipeng Wang
TL;DR
Knowledge distillation often relies on global feature alignment, which neglects fine-grained intra-feature local-region information and can hinder transfer, especially across heterogeneous models, while many contrastive methods depend on large memory buffers. MSDCRD introduces a model-agnostic framework that decouples a global feature into multi-scale local samples via sliding pooling and sample selection, then learns with two tailored contrastive losses in a single batch, eliminating memory bottlenecks. The approach delivers state-of-the-art results in homogeneous settings and strong cross-architecture transfer across CNNs, ViTs, and MLPs, with favorable ablations and visualization showing close teacher–student alignment. This memory-efficient, plug-and-play method has broad practical impact for deploying high-performing compact models on resource-constrained devices while preserving generalization across vision tasks.
Abstract
Knowledge distillation enhances the performance of compact student networks by transferring knowledge from more powerful teacher networks without introducing additional parameters. In the feature space, local regions within an individual global feature encode distinct yet interdependent semantic information. Previous feature-based distillation methods mainly emphasize global feature alignment while neglecting the decoupling of local regions within an individual global feature, which often results in semantic confusion and suboptimal performance. Moreover, conventional contrastive representation distillation suffers from low efficiency due to its reliance on a large memory buffer to store feature samples. To address these limitations, this work proposes MSDCRD, a model-agnostic distillation framework that systematically decouples global features into multi-scale local features and leverages the resulting semantically rich feature samples with tailored sample-wise and feature-wise contrastive losses. This design enables efficient distillation using only a single batch, eliminating the dependence on external memory. Extensive experiments demonstrate that MSDCRD achieves superior performance not only in homogeneous teacher-student settings but also in heterogeneous architectures where feature discrepancies are more pronounced, highlighting its strong generalization capability.
