Table of Contents
Fetching ...

Variance-Covariance Regularization Improves Representation Learning

Jiachen Zhu, Katrina Evtimova, Yubei Chen, Ravid Shwartz-Ziv, Yann LeCun

TL;DR

VCReg addresses the tendency of supervised pretraining to overfit to the source task by regularizing representations to be high-variance and low-covariance. By extending a VICReg-inspired objective to supervised settings and applying it across intermediate layers, VCReg promotes diverse, transferable features with efficient backward-gradient updates. Empirical results across images and videos demonstrate state-of-the-art transfer performance and gains on long-tail and hierarchical tasks, with analyses showing reduced gradient starvation and neural collapse while maintaining information content and noise robustness. Overall, VCReg provides a practical, architecture-agnostic framework that strengthens feature transfer and broadens the applicability of supervised pretraining.

Abstract

Transfer learning plays a key role in advancing machine learning models, yet conventional supervised pretraining often undermines feature transferability by prioritizing features that minimize the pretraining loss. In this work, we adapt a self-supervised learning regularization technique from the VICReg method to supervised learning contexts, introducing Variance-Covariance Regularization (VCReg). This adaptation encourages the network to learn high-variance, low-covariance representations, promoting learning more diverse features. We outline best practices for an efficient implementation of our framework, including applying it to the intermediate representations. Through extensive empirical evaluation, we demonstrate that our method significantly enhances transfer learning for images and videos, achieving state-of-the-art performance across numerous tasks and datasets. VCReg also improves performance in scenarios like long-tail learning and hierarchical classification. Additionally, we show its effectiveness may stem from its success in addressing challenges like gradient starvation and neural collapse. In summary, VCReg offers a universally applicable regularization framework that significantly advances transfer learning and highlights the connection between gradient starvation, neural collapse, and feature transferability.

Variance-Covariance Regularization Improves Representation Learning

TL;DR

VCReg addresses the tendency of supervised pretraining to overfit to the source task by regularizing representations to be high-variance and low-covariance. By extending a VICReg-inspired objective to supervised settings and applying it across intermediate layers, VCReg promotes diverse, transferable features with efficient backward-gradient updates. Empirical results across images and videos demonstrate state-of-the-art transfer performance and gains on long-tail and hierarchical tasks, with analyses showing reduced gradient starvation and neural collapse while maintaining information content and noise robustness. Overall, VCReg provides a practical, architecture-agnostic framework that strengthens feature transfer and broadens the applicability of supervised pretraining.

Abstract

Transfer learning plays a key role in advancing machine learning models, yet conventional supervised pretraining often undermines feature transferability by prioritizing features that minimize the pretraining loss. In this work, we adapt a self-supervised learning regularization technique from the VICReg method to supervised learning contexts, introducing Variance-Covariance Regularization (VCReg). This adaptation encourages the network to learn high-variance, low-covariance representations, promoting learning more diverse features. We outline best practices for an efficient implementation of our framework, including applying it to the intermediate representations. Through extensive empirical evaluation, we demonstrate that our method significantly enhances transfer learning for images and videos, achieving state-of-the-art performance across numerous tasks and datasets. VCReg also improves performance in scenarios like long-tail learning and hierarchical classification. Additionally, we show its effectiveness may stem from its success in addressing challenges like gradient starvation and neural collapse. In summary, VCReg offers a universally applicable regularization framework that significantly advances transfer learning and highlights the connection between gradient starvation, neural collapse, and feature transferability.
Paper Structure (34 sections, 8 equations, 5 figures, 8 tables)

This paper contains 34 sections, 8 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: VCReg regularizes the network by encouraging the intermediate representations to have high variance and low covariance. VCReg is applied to the output of each network block to make all the intermediate representations capture diverse features.
  • Figure 2: Comparative evaluation between training with and without VCReg on a "Two-Moon" Synthetic Dataset. Decision boundaries are averaged over ten distinct runs with random data point sampling and model initialization. A single run's data points are displayed for visual clarity. The contrast between VCReg and "No regularization" underscores the latter's limitations in forming intricate decision boundaries, while highlighting VCReg's effectiveness in generating meaningful ones.
  • Figure 3: Impact of VCReg amidst noisy data: This figure shows the top-1 accuracy of VideoMAE-S and VideoMAEv2-S when fine-tuned for action recognition using HMDB51 corrupted with synthetic noise. We corrupt the data with Gaussian noise with standard deviation $\sigma\in\{1, 1.5, 2\}$. Models with VCReg outperform their non-regularized counterparts in this setting.
  • Figure 4: Impact of VCReg amidst noisy data: This figure shows the top-1 accuracy of VideoMAE-B and VideoMAEv2-B when fine-tuned for action recognition using HMDB51 with synthetic noise. We corrupt the data with Gaussian noise with standard deviation $\sigma\in\{1, 1.5, 2\}$. Models with VCReg outperform their non-regularized counterparts in this setting.
  • Figure 5: The effect of conventional regularization methods and the VCReg on a simple task of two-moon classification. Shown decision boundaries are the average over 10 runs in which data points and the model initialization parameters are sampled randomly. Here, only the data points of one particular seed are plotted for visual clarity. It can be seen that conventional regularizations of deep learning seem not to help with learning a curved decision boundary.