Table of Contents
Fetching ...

Blockwise Self-Supervised Learning at Scale

Shoaib Ahmed Siddiqui, David Krueger, Yann LeCun, Stéphane Deny

TL;DR

This work investigates blockwise self-supervised learning as a scalable alternative to full backpropagation on large-scale data. By partitioning a ResNet-50 into 4 blocks and applying a self-supervised objective (Barlow Twins) with stop-gradient, the authors demonstrate that simultaneous blockwise training can achieve a top-1 linear-probe accuracy on ImageNet of $70.48\%$, only $1.1$ percentage points below the end-to-end trained model at $71.57\%$, with a small boost from noise injection ($\sigma\approx0.25$). The study further shows that this approach generalizes to other SSL losses (SimCLR, VicReg) and that pooling design—specifically Conv-based Expansion with global pooling—significantly influences final performance. Robustness to ImageNet-C degrades under blockwise training, suggesting a trade-off between performance and robustness. The work discusses limitations of fully local learning, the importance of early-block training, and potential future directions, including more specialized local learning rules and neuromorphic hardware implications.

Abstract

Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.

Blockwise Self-Supervised Learning at Scale

TL;DR

This work investigates blockwise self-supervised learning as a scalable alternative to full backpropagation on large-scale data. By partitioning a ResNet-50 into 4 blocks and applying a self-supervised objective (Barlow Twins) with stop-gradient, the authors demonstrate that simultaneous blockwise training can achieve a top-1 linear-probe accuracy on ImageNet of , only percentage points below the end-to-end trained model at , with a small boost from noise injection (). The study further shows that this approach generalizes to other SSL losses (SimCLR, VicReg) and that pooling design—specifically Conv-based Expansion with global pooling—significantly influences final performance. Robustness to ImageNet-C degrades under blockwise training, suggesting a trade-off between performance and robustness. The work discusses limitations of fully local learning, the importance of early-block training, and potential future directions, including more specialized local learning rules and neuromorphic hardware implications.

Abstract

Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.
Paper Structure (35 sections, 2 equations, 16 figures, 1 algorithm)

This paper contains 35 sections, 2 equations, 16 figures, 1 algorithm.

Figures (16)

  • Figure 1: We rank blockwise/local learning methods of the literature according to their biological plausibility (from left to right), and indicate their demonstrated ability to scale to large-scale datasets (e.g., ImageNet) by the intensity of the blue rectangle below each model family. Our method is situated at a unique trade-off between biological plausibility and performance on large-scale datasets, by being on par in performance with intertwined blockwise training xiong2020loco and supervised broadcasted learning nokland2016directclark_credit_2021belilovsky19a while being more biologically plausible than these alternatives. Methods represented with magenta color used a similar methodology as ours halvagal_combination_2022lowe2019putting, but have only been demonstrated to work on small datasets.
  • Figure 2: Overview of our blockwise local learning approach. Each of the 4 blocks of layers of a ResNet-50 is trained independently using a self-supervised learning rule and a local backpropagation path. We apply this procedure in two main settings: (left) sequential training, where each block, starting from the first block, is independently trained and frozen before the next block is trained; (right) simultaneous blockwise training, where all the blocks are trained simultaneously using a 'stop-grad' operation to limit the backpropagation paths to a single block. The yellow gear symbol refers to the combination of the pooling layer, projection head, and loss function used in self-supervised learning.
  • Figure 3: Overview of the pooling strategies employed in this work. The first one (a) is the simple Global Spatial Pooling (GSP) which simply computes a global average activation over the entire feature volume. The second one (b) is the Local Spatial Pooling (LSP) where we divide the feature volume into small spatial bins, compute the averages in these small bins, and concatenate these outputs to compute the final feature vector of size 2048. The third one (c) is a (1x1) Conv-based Expansion (CbE), followed by global spatial pooling. This last pooling strategy provides the best performance for our method.
  • Figure 4: Overview of our key results. (left) Top-1 accuracy of a linear probe trained on top of our pretrained network on ImageNet for different pretraining procedures. Our best blockwise-trained model [B] is almost on par with full backpropagation [A] (only 1.1% performance gap). (right) Accuracy on ImageNet as a function of the depth of the network. Since our blockwise models are trained for each block in isolation, we can include a different number of blocks when computing the linear probe accuracy. We visualize the accuracy as we include more blocks into the network.
  • Figure 5: Impact of the SSL objective function used. Our approach (simultaneous blockwise training with conv-based expansion pooling of the block outputs) almost matches end-to-end training performance for all SSL methods tested.
  • ...and 11 more figures