Table of Contents
Fetching ...

Intra-class Patch Swap for Self-Distillation

Hongjun Choi, Eun Som Jeon, Ankita Shukla, Pavan Turaga

TL;DR

The paper tackles the challenge of distilling knowledge without a fixed, pre-trained teacher by proposing a teacher-free self-distillation framework built on intra-class patch swap augmentation. This augmentation creates pairings within the same class and uses instance-to-instance distillation to align predictive distributions, all within a single network and without architectural changes. Across image classification, semantic segmentation, and object detection, the approach yields consistent gains over both self-distillation baselines and conventional KD, demonstrating improved predictive quality and robustness. The work argues that augmentation design, specifically intra-class patch manipulation, is a key driver of successful self-distillation with practical implications for efficient model compression and deployment.

Abstract

Knowledge distillation (KD) is a valuable technique for compressing large deep learning models into smaller, edge-suitable networks. However, conventional KD frameworks rely on pre-trained high-capacity teacher networks, which introduce significant challenges such as increased memory/storage requirements, additional training costs, and ambiguity in selecting an appropriate teacher for a given student model. Although a teacher-free distillation (self-distillation) has emerged as a promising alternative, many existing approaches still rely on architectural modifications or complex training procedures, which limit their generality and efficiency. To address these limitations, we propose a novel framework based on teacher-free distillation that operates using a single student network without any auxiliary components, architectural modifications, or additional learnable parameters. Our approach is built on a simple yet highly effective augmentation, called intra-class patch swap augmentation. This augmentation simulates a teacher-student dynamic within a single model by generating pairs of intra-class samples with varying confidence levels, and then applying instance-to-instance distillation to align their predictive distributions. Our method is conceptually simple, model-agnostic, and easy to implement, requiring only a single augmentation function. Extensive experiments across image classification, semantic segmentation, and object detection show that our method consistently outperforms both existing self-distillation baselines and conventional teacher-based KD approaches. These results suggest that the success of self-distillation could hinge on the design of the augmentation itself. Our codes are available at https://github.com/hchoi71/Intra-class-Patch-Swap.

Intra-class Patch Swap for Self-Distillation

TL;DR

The paper tackles the challenge of distilling knowledge without a fixed, pre-trained teacher by proposing a teacher-free self-distillation framework built on intra-class patch swap augmentation. This augmentation creates pairings within the same class and uses instance-to-instance distillation to align predictive distributions, all within a single network and without architectural changes. Across image classification, semantic segmentation, and object detection, the approach yields consistent gains over both self-distillation baselines and conventional KD, demonstrating improved predictive quality and robustness. The work argues that augmentation design, specifically intra-class patch manipulation, is a key driver of successful self-distillation with practical implications for efficient model compression and deployment.

Abstract

Knowledge distillation (KD) is a valuable technique for compressing large deep learning models into smaller, edge-suitable networks. However, conventional KD frameworks rely on pre-trained high-capacity teacher networks, which introduce significant challenges such as increased memory/storage requirements, additional training costs, and ambiguity in selecting an appropriate teacher for a given student model. Although a teacher-free distillation (self-distillation) has emerged as a promising alternative, many existing approaches still rely on architectural modifications or complex training procedures, which limit their generality and efficiency. To address these limitations, we propose a novel framework based on teacher-free distillation that operates using a single student network without any auxiliary components, architectural modifications, or additional learnable parameters. Our approach is built on a simple yet highly effective augmentation, called intra-class patch swap augmentation. This augmentation simulates a teacher-student dynamic within a single model by generating pairs of intra-class samples with varying confidence levels, and then applying instance-to-instance distillation to align their predictive distributions. Our method is conceptually simple, model-agnostic, and easy to implement, requiring only a single augmentation function. Extensive experiments across image classification, semantic segmentation, and object detection show that our method consistently outperforms both existing self-distillation baselines and conventional teacher-based KD approaches. These results suggest that the success of self-distillation could hinge on the design of the augmentation itself. Our codes are available at https://github.com/hchoi71/Intra-class-Patch-Swap.

Paper Structure

This paper contains 21 sections, 3 equations, 10 figures, 15 tables, 1 algorithm.

Figures (10)

  • Figure 1: Three different distillation mechanisms. Teacher-to-Student and Student-to-Student require multiple networks, while self-distillation (ours) needs a single network to train it.
  • Figure 2: Top: Illustrations of design choices when learning the single network (ResNet-18) on CIFAR100. Instead of matching function directly (Hard Label, CutMix), our method tries to match the outputs (called relaxed knowledge) coming from two swapped inputs while we still use cross-entropy loss with the hard label. Bottom: Averaged top-1 probabilities of the target label from the test data. The blue and orange graphs indicate the averaged top-1 probability of correctly classified samples and misclassified samples, respectively. Compared to other methods, our method guarantees a high quality of predictive distribution, judging by the fact that even mispredictions still belong to the people class.
  • Figure 3: Overall framework of the proposed method. The process of generating new training inputs by exchanging patches between positive sample pairs creates a strong teaching signal, leading to high prediction confidence in the target class (Relaxed Knowledge 1) and low prediction confidence in the same target for the image that lost the strong signal (Relaxed Knowledge 2). By matching these outputs, the network learns all relevant parts of the object, as demonstrated by a well-matched graph in the final figure.
  • Figure 4: Magnitude of gradients in each layer from ResNet18. Patch swap, as used in our method, can indirectly alleviate the vanishing gradient problem that may arise during training.
  • Figure 5: The averaged L1 norm of the gradient. Patch swap keeps relatively high values after 150 epochs, which suggests that the model continues to learn throughout the training process.
  • ...and 5 more figures