Table of Contents
Fetching ...

Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks

Shin'ya Yamaguchi, Sekitoshi Kanai, Kazuki Adachi, Daiki Chijiwa

TL;DR

A simple method called adaptive random feature regularization (AdaRand), which helps the feature extractors of training models to adaptively change the distribution of feature vectors for downstream classification tasks without auxiliary source information and with reasonable computation costs.

Abstract

While fine-tuning is a de facto standard method for training deep neural networks, it still suffers from overfitting when using small target datasets. Previous methods improve fine-tuning performance by maintaining knowledge of the source datasets or introducing regularization terms such as contrastive loss. However, these methods require auxiliary source information (e.g., source labels or datasets) or heavy additional computations. In this paper, we propose a simple method called adaptive random feature regularization (AdaRand). AdaRand helps the feature extractors of training models to adaptively change the distribution of feature vectors for downstream classification tasks without auxiliary source information and with reasonable computation costs. To this end, AdaRand minimizes the gap between feature vectors and random reference vectors that are sampled from class conditional Gaussian distributions. Furthermore, AdaRand dynamically updates the conditional distribution to follow the currently updated feature extractors and balance the distance between classes in feature spaces. Our experiments show that AdaRand outperforms the other fine-tuning regularization, which requires auxiliary source information and heavy computation costs.

Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks

TL;DR

A simple method called adaptive random feature regularization (AdaRand), which helps the feature extractors of training models to adaptively change the distribution of feature vectors for downstream classification tasks without auxiliary source information and with reasonable computation costs.

Abstract

While fine-tuning is a de facto standard method for training deep neural networks, it still suffers from overfitting when using small target datasets. Previous methods improve fine-tuning performance by maintaining knowledge of the source datasets or introducing regularization terms such as contrastive loss. However, these methods require auxiliary source information (e.g., source labels or datasets) or heavy additional computations. In this paper, we propose a simple method called adaptive random feature regularization (AdaRand). AdaRand helps the feature extractors of training models to adaptively change the distribution of feature vectors for downstream classification tasks without auxiliary source information and with reasonable computation costs. To this end, AdaRand minimizes the gap between feature vectors and random reference vectors that are sampled from class conditional Gaussian distributions. Furthermore, AdaRand dynamically updates the conditional distribution to follow the currently updated feature extractors and balance the distance between classes in feature spaces. Our experiments show that AdaRand outperforms the other fine-tuning regularization, which requires auxiliary source information and heavy computation costs.
Paper Structure (38 sections, 10 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 38 sections, 10 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Intuitive comparison of RandReg and AdaRand (proposed method). RandReg regularizes a feature extractor by minimizing the gap between input features and noises generated from a fixed prior distribution. Although RandReg is very simple, it tends to concentrate the features in local regions due to the prior being fixed and class-agnostic, preventing separate classes. In contrast, AdaRand adopts class conditional priors and dynamically updates them with the running feature statistics of training models and the maximization distances between each pair of class conditional priors. This helps models to obtain more separable features and improve accuracy.
  • Figure 2: Feature Norm
  • Figure 3: Gradient Norm
  • Figure 4: Feature Entropy
  • Figure 5: Mutual Information
  • ...and 4 more figures