Table of Contents
Fetching ...

Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning

Florian Bordes, Randall Balestriero, Quentin Garrido, Adrien Bardes, Pascal Vincent

TL;DR

The paper tackles the paradoxical finding that removing the projection head after self-supervised training often yields superior downstream performance, highlighting a misalignment between pretraining objectives and downstream tasks. It formalizes Guillotine Regularization (GR) as evaluating downstream readouts on an intermediate trunk rather than the full last layers, and conducts extensive experiments across supervised and SSL setups to show that the optimal cut depends on training, data, and task. Key findings include that trunk and head readouts are not consistently predictive of each other, and that aligning the pretext and downstream tasks can reduce or even eliminate the need for a projector. These results have practical implications for SSL evaluation practices and point toward alignment-aware training and augmentation strategies to improve generalization.

Abstract

One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method, and using this network on downstream tasks but with its last few projector layers entirely removed. This trick of throwing away the projector is actually critical for SSL methods to display competitive performances on ImageNet for which more than 30 percentage points can be gained that way. This is a little vexing, as one would hope that the network layer at which invariance is explicitly enforced by the SSL criterion during training (the last projector layer) should be the one to use for best generalization performance downstream. But it seems not to be, and this study sheds some light on why. This trick, which we name Guillotine Regularization (GR), is in fact a generically applicable method that has been used to improve generalization performance in transfer learning scenarios. In this work, we identify the underlying reasons behind its success and show that the optimal layer to use might change significantly depending on the training setup, the data or the downstream task. Lastly, we give some insights on how to reduce the need for a projector in SSL by aligning the pretext SSL task and the downstream task.

Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning

TL;DR

The paper tackles the paradoxical finding that removing the projection head after self-supervised training often yields superior downstream performance, highlighting a misalignment between pretraining objectives and downstream tasks. It formalizes Guillotine Regularization (GR) as evaluating downstream readouts on an intermediate trunk rather than the full last layers, and conducts extensive experiments across supervised and SSL setups to show that the optimal cut depends on training, data, and task. Key findings include that trunk and head readouts are not consistently predictive of each other, and that aligning the pretext and downstream tasks can reduce or even eliminate the need for a projector. These results have practical implications for SSL evaluation practices and point toward alignment-aware training and augmentation strategies to improve generalization.

Abstract

One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method, and using this network on downstream tasks but with its last few projector layers entirely removed. This trick of throwing away the projector is actually critical for SSL methods to display competitive performances on ImageNet for which more than 30 percentage points can be gained that way. This is a little vexing, as one would hope that the network layer at which invariance is explicitly enforced by the SSL criterion during training (the last projector layer) should be the one to use for best generalization performance downstream. But it seems not to be, and this study sheds some light on why. This trick, which we name Guillotine Regularization (GR), is in fact a generically applicable method that has been used to improve generalization performance in transfer learning scenarios. In this work, we identify the underlying reasons behind its success and show that the optimal layer to use might change significantly depending on the training setup, the data or the downstream task. Lastly, we give some insights on how to reduce the need for a projector in SSL by aligning the pretext SSL task and the downstream task.
Paper Structure (18 sections, 3 equations, 17 figures, 2 tables)

This paper contains 18 sections, 3 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1:
  • Figure 2:
  • Figure 4: Training a linear regression to predict latent variables from pooled intermediate representations of a network trained with a self-supervised objective (using SimCLR) or a supervised objective (trained to predict 3D rotations of an object). The data used consists of renderings of 3d objects from 3D Warehouse 3dwarehouse where we control the floor, lighting and object pose with latent variables, see samples on the right. The dimension of the intermediate representations increases throughout the layers and is kept constant in the head, if there is one. In the supervised setting, when looking at the Validation Mean Squared Error for object rotations prediction, the lowest error is obtained with the linear probe at the last layer of the neural networks. In contrast, the lowest error for other attributes like the Spot $\theta$ prediction are obtained with the linear probes localized 3,4 or 5 layers before the output of the networks. In the self-supervised setting, we also see that the predictor is responsible for a lot of the invariance to augmentation, and that the information is most easily retrievable before it. These results highlight the need to use Guillotine Regularization i.e removing the last layers of the neural network to generalize better on other tasks.
  • Figure 5: Supervised: The optimal layer to cut might change depending of the training optimization, the data and the downstream task. The best accuracy for each curve is show as a big square. For each experiments, we trained a headed supervised Resnet50 over ImageNet (with a 3 layer MLP as projection head). For a) and c) we trained this network over the full training set whereas for b) we use a random subset of 250 classes. Then, we froze the model parameters and trained linear probes over representation at different layers. a) We trained two models with different optimization pipeline: the first one in blue was trained with SGD using a cycling learning rate, along with momentum and weight decay. The second one in gray was trained with AdamW without additional regularization. This model is overfitting on the training set, which leads to similar validation performances across the backbone and projector. In contrast, the first one generalize much better but the performances across layers change significantly. b) Validation accuracy given by linear probes on different random subset of 250 ImageNet's classes for each layers. The validation split in gray corresponds to the same subset of classes that was used for training whereas Split 1-6 corresponds to different OOD random split. In this instance, we see that the optimal layer to use is the first layer of the projector. c) Validation performances on different downstream tasks. We have used the well regularized model from a) and evaluate it across different downstream tasks. For some datasets, the optimal layer to use is the last one, while for some other the optimal layer is the second layer of the projector.
  • Figure 6: SimCLR: Linear probe accuracy on several downstream tasks. The optimal layer to cut is not the same for different downstream tasks.
  • ...and 12 more figures