Table of Contents
Fetching ...

Improving Pre-trained Self-Supervised Embeddings Through Effective Entropy Maximization

Deep Chakraborty, Yann LeCun, Tim G. J. Rudner, Erik Learned-Miller

TL;DR

This work tackles the challenge of squeezing additional performance from already-pretrained self-supervised embeddings by maximizing entropy through a practically estimable, low-dimensional criterion. The proposed E2MC objective adds an entropy term based on 1D marginals and a covariance penalty to standard SSL losses, with embeddings mapped to compact spaces via sigmoid or Gaussian-based transforms to ensure meaningful entropy estimates. Empirical results on ImageNet and transfer datasets show that a handful of continued pre-training epochs with E2MC yields consistent, sometimes substantial, improvements across VICReg, SwAV, and, to a lesser extent, SimSiam, while ablations highlight the necessity of both entropy and covariance components. The approach is computationally efficient, does not require high-dimensional joint-entropy estimation, and offers a practical path to enhance downstream performance in resource-constrained settings, with potential applicability to larger transformer-based models in the future.

Abstract

A number of different architectures and loss functions have been applied to the problem of self-supervised learning (SSL), with the goal of developing embeddings that provide the best possible pre-training for as-yet-unknown, lightly supervised downstream tasks. One of these SSL criteria is to maximize the entropy of a set of embeddings in some compact space. But the goal of maximizing the embedding entropy often depends -- whether explicitly or implicitly -- upon high dimensional entropy estimates, which typically perform poorly in more than a few dimensions. In this paper, we motivate an effective entropy maximization criterion (E2MC), defined in terms of easy-to-estimate, low-dimensional constraints. We demonstrate that using it to continue training an already-trained SSL model for only a handful of epochs leads to a consistent and, in some cases, significant improvement in downstream performance. We perform careful ablation studies to show that the improved performance is due to the proposed add-on criterion. We also show that continued pre-training with alternative criteria does not lead to notable improvements, and in some cases, even degrades performance.

Improving Pre-trained Self-Supervised Embeddings Through Effective Entropy Maximization

TL;DR

This work tackles the challenge of squeezing additional performance from already-pretrained self-supervised embeddings by maximizing entropy through a practically estimable, low-dimensional criterion. The proposed E2MC objective adds an entropy term based on 1D marginals and a covariance penalty to standard SSL losses, with embeddings mapped to compact spaces via sigmoid or Gaussian-based transforms to ensure meaningful entropy estimates. Empirical results on ImageNet and transfer datasets show that a handful of continued pre-training epochs with E2MC yields consistent, sometimes substantial, improvements across VICReg, SwAV, and, to a lesser extent, SimSiam, while ablations highlight the necessity of both entropy and covariance components. The approach is computationally efficient, does not require high-dimensional joint-entropy estimation, and offers a practical path to enhance downstream performance in resource-constrained settings, with potential applicability to larger transformer-based models in the future.

Abstract

A number of different architectures and loss functions have been applied to the problem of self-supervised learning (SSL), with the goal of developing embeddings that provide the best possible pre-training for as-yet-unknown, lightly supervised downstream tasks. One of these SSL criteria is to maximize the entropy of a set of embeddings in some compact space. But the goal of maximizing the embedding entropy often depends -- whether explicitly or implicitly -- upon high dimensional entropy estimates, which typically perform poorly in more than a few dimensions. In this paper, we motivate an effective entropy maximization criterion (E2MC), defined in terms of easy-to-estimate, low-dimensional constraints. We demonstrate that using it to continue training an already-trained SSL model for only a handful of epochs leads to a consistent and, in some cases, significant improvement in downstream performance. We perform careful ablation studies to show that the improved performance is due to the proposed add-on criterion. We also show that continued pre-training with alternative criteria does not lead to notable improvements, and in some cases, even degrades performance.

Paper Structure

This paper contains 38 sections, 15 equations, 15 figures, 5 tables, 1 algorithm.

Figures (15)

  • Figure 1: An overview of our continued pre-training with E$\mathnormal{2}$MC approach with three main stages: SSL model selection, training using augmented criterion, and evaluating updated representations on downstream tasks.
  • Figure 2: (a). A 2-$d$ uniform distribution. (b) An "X" distribution. Both (analytic) distributions have uniform (max-entropy) marginals and decorrelated components, and minimize our loss function. (c) Example 2-d marginal distribution over a random pair from VICReg vicreg (after transformation to compact space). (d) Our embeddings over the same pair of dimensions, where empirical results show, to our surprise, distributions with uniform 2-$d$ marginals despite the fact that this is not explicitly enforced by our loss. The colors denote the relative positions of actual points in embedding space before (c) and after (d) the application of our maximum-entropy criterion demonstrating how they are spread out by our method.
  • Figure 3:
  • Figure 4:
  • Figure 6: Top-1-Accuracy of a linear classifier trained using 1% ImageNet labels at different epochs of continued pre-training. Continued pre-training with our criteria (E$\mathnormal{2}$MC) outperforms other baselines, and performance beyond the reported ten epochs either improves marginally or degrades depending on the method.
  • ...and 10 more figures