Table of Contents
Fetching ...

Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

Scott C. Lowe, Anthony Fuller, Sageev Oore, Evan Shelhamer, Graham W. Taylor

Abstract

The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.

Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

Abstract

The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.
Paper Structure (59 sections, 12 figures, 27 tables)

This paper contains 59 sections, 12 figures, 27 tables.

Figures (12)

  • Figure 1: Multi-layer self-distillation with Bootleg. The teacher-encoder (blue), student-encoder (green), and predictor (orange) are ViTs, made of repeated transformer blocks. A schematic of a single transformer block is overlaid (bottom right). The teacher-encoder is an EMA of the student-encoder, and processes the full image. The student-encoder sees a subset of the image and must create embeddings of them to facilitate the predictor. The predictor processes the embeddings to predict representations at multiple layers within the teacher-encoder.
  • Figure 2: Bridging I-JEPA and MAE with targets across hidden layers. Left: We train ViT-S with I-JEPA, except for the target which we change to be a hidden layer ($x$-axis) of the teacher-encoder instead of its final output (blue curve) or in addition to the final output (red). We plot the frozen attentive probe top-1 accuracy on IN-1k ($y$-axis) for the respective encoders. Middle: Similar, but for ViT-B to verify at larger scale. Right: We train ViT-S using MAE, except we add EMA for self-distillation and change the target to be a hidden layer of the EMA model instead of the input pixels (blue) or in addition to the image pixels (green). We compare to a single target with our masking strategy (cyan). In all cases, self-distillation of a hidden target is able to improve on predicting only the input or the output.
  • Figure 3: Accompanies Fig. 2 of the main text. Bridging I-JEPA and MAE with targets across hidden layers. Left: As in Fig. 2 of the main text, we train ViT-S using with MAE, except we add EMA for self-distillation and change the target to be a hidden layer of the EMA model instead of the input pixels (blue) or in addition to the image pixels (green). We compare to a single target with our masking strategy (cyan), and to a single target but using the L1 loss (yellow). Using the L1 loss is insufficient to prevent training instabilities when using deeper layers as targets. Right: Similar, but for ViT-B with MAE config.
  • Figure 4: Sample masks generated by I-JEPA and Bootleg's masking strategies. Visible tokens are shown in blue. Prediction masks are shown in green/yellow/orange/red with one colour per mask rectangle. These prediction masks can overlap, leading to brighter shades of orange/yellow. Unused tokens (seen by the teacher-encoder, but not by either the student-encoder or predictor) are shown in black. For both I-JEPA and Bootleg, we show samples from worker-batches of 256 samples, generated with seeds 0,1,2,3. We show the first 4 examples from each batch (grouped horizontally). These samples illustrate how the visible tokens (blue) are always at the top of the image for I-JEPA's masking, but are better distributed for Bootleg's masks.
  • Figure 5: Rates at which each token position within the 14 × 14 grid is visible and presented to the student-encoder. Top row: I-JEPA masking. Bottom row: Bootleg masking. We show the change in visibility of the tokens as the per-GPU batch size is increased from 4 (left) to 1024 (right). Red marks on the colorbars indicate the least and most frequent any token is visible for a given plot. Note that token visibility is more consistent across space (a flatter distribution) for Bootleg masks. The I-JEPA masks omit the bottom row and right column entirely, rarely present the centre of the image, and last-in-first-out truncation leads to the bottom four rows being seldom presented when the per-GPU batch size is larger.
  • ...and 7 more figures