LayerLock: Non-collapsing Representation Learning with Progressive Freezing
Goker Erdogan, Nikhil Parthasarathy, Catalin Ionescu, Drew A. Hudson, Alexander Lerchner, Andrew Zisserman, Mehdi S. M. Sajjadi, Joao Carreira
TL;DR
LayerLock tackles self-supervised video representation learning by exploiting the observed ordered convergence of ViT layers during MAE training: shallower layers converge earlier, enabling progressive freezing and a shift from pixel to latent targets. The method sequentially freezes layers and transitions the prediction target from pixels to progressively deeper layer activations, using a patch-wise decoder and a 3D RoPE positional scheme; it is applicable to both pixel-based MAE and latent-prediction frameworks (e.g., V-JEPA). The authors show that progressive freezing reduces compute and memory and avoids representation collapse that often accompanies latent training, while achieving strong performance on downstream tasks such as action classification and depth estimation. They validate LayerLock on large-scale models (up to 4B parameters), outperforming non-latent masked prediction baselines on the 4DS perception suite and demonstrating generality across pixel and latent paradigms.
Abstract
We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from "representation collapse". We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.
