Table of Contents
Fetching ...

Contrastive Representation Distillation via Multi-Scale Feature Decoupling

Cuipeng Wang, Haipeng Wang

TL;DR

Knowledge distillation often relies on global feature alignment, which neglects fine-grained intra-feature local-region information and can hinder transfer, especially across heterogeneous models, while many contrastive methods depend on large memory buffers. MSDCRD introduces a model-agnostic framework that decouples a global feature into multi-scale local samples via sliding pooling and sample selection, then learns with two tailored contrastive losses in a single batch, eliminating memory bottlenecks. The approach delivers state-of-the-art results in homogeneous settings and strong cross-architecture transfer across CNNs, ViTs, and MLPs, with favorable ablations and visualization showing close teacher–student alignment. This memory-efficient, plug-and-play method has broad practical impact for deploying high-performing compact models on resource-constrained devices while preserving generalization across vision tasks.

Abstract

Knowledge distillation enhances the performance of compact student networks by transferring knowledge from more powerful teacher networks without introducing additional parameters. In the feature space, local regions within an individual global feature encode distinct yet interdependent semantic information. Previous feature-based distillation methods mainly emphasize global feature alignment while neglecting the decoupling of local regions within an individual global feature, which often results in semantic confusion and suboptimal performance. Moreover, conventional contrastive representation distillation suffers from low efficiency due to its reliance on a large memory buffer to store feature samples. To address these limitations, this work proposes MSDCRD, a model-agnostic distillation framework that systematically decouples global features into multi-scale local features and leverages the resulting semantically rich feature samples with tailored sample-wise and feature-wise contrastive losses. This design enables efficient distillation using only a single batch, eliminating the dependence on external memory. Extensive experiments demonstrate that MSDCRD achieves superior performance not only in homogeneous teacher-student settings but also in heterogeneous architectures where feature discrepancies are more pronounced, highlighting its strong generalization capability.

Contrastive Representation Distillation via Multi-Scale Feature Decoupling

TL;DR

Knowledge distillation often relies on global feature alignment, which neglects fine-grained intra-feature local-region information and can hinder transfer, especially across heterogeneous models, while many contrastive methods depend on large memory buffers. MSDCRD introduces a model-agnostic framework that decouples a global feature into multi-scale local samples via sliding pooling and sample selection, then learns with two tailored contrastive losses in a single batch, eliminating memory bottlenecks. The approach delivers state-of-the-art results in homogeneous settings and strong cross-architecture transfer across CNNs, ViTs, and MLPs, with favorable ablations and visualization showing close teacher–student alignment. This memory-efficient, plug-and-play method has broad practical impact for deploying high-performing compact models on resource-constrained devices while preserving generalization across vision tasks.

Abstract

Knowledge distillation enhances the performance of compact student networks by transferring knowledge from more powerful teacher networks without introducing additional parameters. In the feature space, local regions within an individual global feature encode distinct yet interdependent semantic information. Previous feature-based distillation methods mainly emphasize global feature alignment while neglecting the decoupling of local regions within an individual global feature, which often results in semantic confusion and suboptimal performance. Moreover, conventional contrastive representation distillation suffers from low efficiency due to its reliance on a large memory buffer to store feature samples. To address these limitations, this work proposes MSDCRD, a model-agnostic distillation framework that systematically decouples global features into multi-scale local features and leverages the resulting semantically rich feature samples with tailored sample-wise and feature-wise contrastive losses. This design enables efficient distillation using only a single batch, eliminating the dependence on external memory. Extensive experiments demonstrate that MSDCRD achieves superior performance not only in homogeneous teacher-student settings but also in heterogeneous architectures where feature discrepancies are more pronounced, highlighting its strong generalization capability.

Paper Structure

This paper contains 13 sections, 21 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Different Local Regions in The CAMs Focus on Distinct Category Information. Different colors denote different classes.
  • Figure 2: Illustration of Previous Distillation Methods and The Proposed MSDCRD. (a) Logit-based distillation, which minimizes the KL divergence between the student and teacher logits. (b) Previous feature-based distillation, which align the global feature representations of the student and teacher networks. (c)The proposed MSDCRD method, which performs multi-scale decoupling on an individual global feature and integrates the resulting features with two efficient sample-wise and feature-wise contrastive losses.
  • Figure 3: Similarity Heatmap of Intermediate Features in Homogeneous vs. Heterogeneous Models Measured by CKA. This work compares features from ConvNeXt(CNN), ViT(Transformer) and Mixer(MLP).
  • Figure 4: Visualization of The Multi-Scale Decoupling Process. (a) CAM visualizations. A single image is fed into the teacher network, and the top-3 class activation maps (CAMs) are visualized. (b) Multi-scale Decoupling. Multi-scale pooling is equivalently achieved by partitioning a single input into multiple local regions and extracting their features separately, followed by sample selection to further process the resulting feature samples.
  • Figure 5: Maximum Softmax Probability Distributions on ImageNet with and without Multi-Scale Pooling (MSP). The first column shows the maximum softmax probability distributions of samples from different teacher networks without MSP, while the second column shows the corresponding distributions with MSP.
  • ...and 1 more figures