Blockwise Self-Supervised Learning at Scale
Shoaib Ahmed Siddiqui, David Krueger, Yann LeCun, Stéphane Deny
TL;DR
This work investigates blockwise self-supervised learning as a scalable alternative to full backpropagation on large-scale data. By partitioning a ResNet-50 into 4 blocks and applying a self-supervised objective (Barlow Twins) with stop-gradient, the authors demonstrate that simultaneous blockwise training can achieve a top-1 linear-probe accuracy on ImageNet of $70.48\%$, only $1.1$ percentage points below the end-to-end trained model at $71.57\%$, with a small boost from noise injection ($\sigma\approx0.25$). The study further shows that this approach generalizes to other SSL losses (SimCLR, VicReg) and that pooling design—specifically Conv-based Expansion with global pooling—significantly influences final performance. Robustness to ImageNet-C degrades under blockwise training, suggesting a trade-off between performance and robustness. The work discusses limitations of fully local learning, the importance of early-block training, and potential future directions, including more specialized local learning rules and neuromorphic hardware implications.
Abstract
Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.
