Table of Contents
Fetching ...

LayerLock: Non-collapsing Representation Learning with Progressive Freezing

Goker Erdogan, Nikhil Parthasarathy, Catalin Ionescu, Drew A. Hudson, Alexander Lerchner, Andrew Zisserman, Mehdi S. M. Sajjadi, Joao Carreira

TL;DR

LayerLock tackles self-supervised video representation learning by exploiting the observed ordered convergence of ViT layers during MAE training: shallower layers converge earlier, enabling progressive freezing and a shift from pixel to latent targets. The method sequentially freezes layers and transitions the prediction target from pixels to progressively deeper layer activations, using a patch-wise decoder and a 3D RoPE positional scheme; it is applicable to both pixel-based MAE and latent-prediction frameworks (e.g., V-JEPA). The authors show that progressive freezing reduces compute and memory and avoids representation collapse that often accompanies latent training, while achieving strong performance on downstream tasks such as action classification and depth estimation. They validate LayerLock on large-scale models (up to 4B parameters), outperforming non-latent masked prediction baselines on the 4DS perception suite and demonstrating generality across pixel and latent paradigms.

Abstract

We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from "representation collapse". We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.

LayerLock: Non-collapsing Representation Learning with Progressive Freezing

TL;DR

LayerLock tackles self-supervised video representation learning by exploiting the observed ordered convergence of ViT layers during MAE training: shallower layers converge earlier, enabling progressive freezing and a shift from pixel to latent targets. The method sequentially freezes layers and transitions the prediction target from pixels to progressively deeper layer activations, using a patch-wise decoder and a 3D RoPE positional scheme; it is applicable to both pixel-based MAE and latent-prediction frameworks (e.g., V-JEPA). The authors show that progressive freezing reduces compute and memory and avoids representation collapse that often accompanies latent training, while achieving strong performance on downstream tasks such as action classification and depth estimation. They validate LayerLock on large-scale models (up to 4B parameters), outperforming non-latent masked prediction baselines on the 4DS perception suite and demonstrating generality across pixel and latent paradigms.

Abstract

We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from "representation collapse". We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.

Paper Structure

This paper contains 23 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: In video masked auto-encoding, network layers converge during training in order of their depth. We measure the final loss ($L_{base}$) of a baseline (unfrozen) model after training for 14000 steps. Each point, similarly shows the corresponding final training loss when freezing the network up to layer $L$ at step $T$. We see that shallow layers "converge" faster than deeper layers as they can be frozen at earlier steps while still being able to minimize the final loss very close to $L_{base}$). In other words, layer convergence order is correlated with layer depth, motivating our proposed LayerLock progressive freezing approach to learning.
  • Figure 2: Proposed learning paradigm. Learning goes through multiple stages as the model switches between prediction targets: Left: no frozen layers, predicting pixels $x$, Middle: freezing first layer and predicting output of first layer $h_1$, Right: freezing first two layers and predicting the output of the second layer $h_2$. $z$ are latents added for decoding. For illustration purposes, only a single Transformer block is shown after $v_c$ and $z$ are concatenated. Freezing continues progressively, according to a pre-determined schedule.
  • Figure 3: Progressive freezing of layers saves on total training cost and peak memory utilization without loss in performance.(Right) Cumulative training cost (PetaFLOPs) and peak memory usage at each step (GiB) are shown for the baseline (unfrozen) and our method. Freezing events are indicated by gray vertical lines. (Left) Performance of baseline unfrozen MAE (dotted) vs. progressive freezing MAE on SSV2 (accuracy) and ScanNet depth pred (rel. error).