Table of Contents
Fetching ...

Diverse Feature Learning by Self-distillation and Reset

Sejik Park

TL;DR

Diverse Feature Learning (DFL) addresses the dual challenge of forgetting learned features and failing to acquire new ones by integrating two complementary strategies: self-distillation-based feature preservation and reset-driven exploration of new feature spaces. The method treats high-quality weights encountered during training as teachers and selects them via a meaningfulness criterion, while periodically reinitializing the student head to encourage learning new features. Empirical results on CIFAR-10 and CIFAR-100 across several lightweight architectures show that combining these components yields synergistic improvements, with ablations highlighting the importance of the number of teachers, cycle length, and the depth of the student head. The work suggests that coupling ensemble-inspired preservation with weight-space exploration can enhance feature diversity and model performance, albeit with caveats around the reliability of the meaningfulness measure and hyperparameter sensitivity.

Abstract

Our paper addresses the problem of models struggling to learn diverse features, due to either forgetting previously learned features or failing to learn new ones. To overcome this problem, we introduce Diverse Feature Learning (DFL), a method that combines an important feature preservation algorithm with a new feature learning algorithm. Specifically, for preserving important features, we utilize self-distillation in ensemble models by selecting the meaningful model weights observed during training. For learning new features, we employ reset that involves periodically re-initializing part of the model. As a result, through experiments with various models on the image classification, we have identified the potential for synergistic effects between self-distillation and reset.

Diverse Feature Learning by Self-distillation and Reset

TL;DR

Diverse Feature Learning (DFL) addresses the dual challenge of forgetting learned features and failing to acquire new ones by integrating two complementary strategies: self-distillation-based feature preservation and reset-driven exploration of new feature spaces. The method treats high-quality weights encountered during training as teachers and selects them via a meaningfulness criterion, while periodically reinitializing the student head to encourage learning new features. Empirical results on CIFAR-10 and CIFAR-100 across several lightweight architectures show that combining these components yields synergistic improvements, with ablations highlighting the importance of the number of teachers, cycle length, and the depth of the student head. The work suggests that coupling ensemble-inspired preservation with weight-space exploration can enhance feature diversity and model performance, albeit with caveats around the reliability of the meaningfulness measure and hyperparameter sensitivity.

Abstract

Our paper addresses the problem of models struggling to learn diverse features, due to either forgetting previously learned features or failing to learn new ones. To overcome this problem, we introduce Diverse Feature Learning (DFL), a method that combines an important feature preservation algorithm with a new feature learning algorithm. Specifically, for preserving important features, we utilize self-distillation in ensemble models by selecting the meaningful model weights observed during training. For learning new features, we employ reset that involves periodically re-initializing part of the model. As a result, through experiments with various models on the image classification, we have identified the potential for synergistic effects between self-distillation and reset.
Paper Structure (16 sections, 4 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 16 sections, 4 equations, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overall Training Process Our algorithm includes the processes of updating the student through self-distillation for feature preservation, updating the teacher based on meaningfulness for efficient ensemble, and updating the student with reset for new feature learning.
  • Figure 2: Update Student Training the model $\theta$ involves using the classification loss $\mathcal{L}_{\text{main}}$ and self-distillation loss $\mathcal{L}_{\text{distill}}$. With these losses, the student $\phi_0$ and the body $\theta_{\text{body}}$ are updated, while the teacher set $\Phi = \{\phi_k\}_{k=1}^K$ is kept frozen.
  • Figure 3: Update Teacher Updating the teacher set $\Phi$ involves replacing the least performing teacher $\phi_{k'}$ with the student $\phi_0$ if the former underperforms. This determination is based on the meaningfulness measurement $f$. For the image classification, we measure the meaningfulness $p_k$ with the training accuracy in the most recent epoch.
  • Figure 4: Self-distillation Loss over Different Dataset It represents the mean self-distillation loss and standard deviation using the default settings of Table \ref{['tbl:ablation']}. The lines of the graph represent the mean, while the shaded areas indicate the standard deviation.
  • Figure 5: Self-distillation Loss over Different Model Architecture It represents the mean self-distillation loss and standard deviation using the setting of Table \ref{['tbl:diverse']}. The lines of the graph represent the mean, while the shaded areas indicate the standard deviation.
  • ...and 1 more figures