Table of Contents
Fetching ...

Scaling Supervised Local Learning with Augmented Auxiliary Networks

Chenxiang Ma, Jibin Wu, Chenyang Si, Kay Chen Tan

TL;DR

This work tackles the scalability gap of supervised local learning by augmenting each hidden layer's auxiliary network with a carefully sampled subset of downstream layers, and by introducing a pyramidal depth that linearly reduces auxiliary depth as layers approach the output. The resulting AugLocal framework promotes stronger synergy between local layers and downstream processing, enabling near BP accuracy on large networks while substantially reducing GPU memory usage. Key contributions include a principled construction rule for augmented auxiliary networks, a depth-scheduling strategy to control compute, and extensive empirical validation across CIFAR, ImageNet, and various ConvNet backbones, supported by representation- similarity and linear-probing analyses. The approach offers a practical path to scalable, memory-efficient training of deep networks on resource-constrained platforms, with potential for parallelized implementations and further synergy with advanced local losses.

Abstract

Deep neural networks are typically trained using global error signals that backpropagate (BP) end-to-end, which is not only biologically implausible but also suffers from the update locking problem and requires huge memory consumption. Local learning, which updates each layer independently with a gradient-isolated auxiliary network, offers a promising alternative to address the above problems. However, existing local learning methods are confronted with a large accuracy gap with the BP counterpart, particularly for large-scale networks. This is due to the weak coupling between local layers and their subsequent network layers, as there is no gradient communication across layers. To tackle this issue, we put forward an augmented local learning method, dubbed AugLocal. AugLocal constructs each hidden layer's auxiliary network by uniformly selecting a small subset of layers from its subsequent network layers to enhance their synergy. We also propose to linearly reduce the depth of auxiliary networks as the hidden layer goes deeper, ensuring sufficient network capacity while reducing the computational cost of auxiliary networks. Our extensive experiments on four image classification datasets (i.e., CIFAR-10, SVHN, STL-10, and ImageNet) demonstrate that AugLocal can effectively scale up to tens of local layers with a comparable accuracy to BP-trained networks while reducing GPU memory usage by around 40%. The proposed AugLocal method, therefore, opens up a myriad of opportunities for training high-performance deep neural networks on resource-constrained platforms.Code is available at https://github.com/ChenxiangMA/AugLocal.

Scaling Supervised Local Learning with Augmented Auxiliary Networks

TL;DR

This work tackles the scalability gap of supervised local learning by augmenting each hidden layer's auxiliary network with a carefully sampled subset of downstream layers, and by introducing a pyramidal depth that linearly reduces auxiliary depth as layers approach the output. The resulting AugLocal framework promotes stronger synergy between local layers and downstream processing, enabling near BP accuracy on large networks while substantially reducing GPU memory usage. Key contributions include a principled construction rule for augmented auxiliary networks, a depth-scheduling strategy to control compute, and extensive empirical validation across CIFAR, ImageNet, and various ConvNet backbones, supported by representation- similarity and linear-probing analyses. The approach offers a practical path to scalable, memory-efficient training of deep networks on resource-constrained platforms, with potential for parallelized implementations and further synergy with advanced local losses.

Abstract

Deep neural networks are typically trained using global error signals that backpropagate (BP) end-to-end, which is not only biologically implausible but also suffers from the update locking problem and requires huge memory consumption. Local learning, which updates each layer independently with a gradient-isolated auxiliary network, offers a promising alternative to address the above problems. However, existing local learning methods are confronted with a large accuracy gap with the BP counterpart, particularly for large-scale networks. This is due to the weak coupling between local layers and their subsequent network layers, as there is no gradient communication across layers. To tackle this issue, we put forward an augmented local learning method, dubbed AugLocal. AugLocal constructs each hidden layer's auxiliary network by uniformly selecting a small subset of layers from its subsequent network layers to enhance their synergy. We also propose to linearly reduce the depth of auxiliary networks as the hidden layer goes deeper, ensuring sufficient network capacity while reducing the computational cost of auxiliary networks. Our extensive experiments on four image classification datasets (i.e., CIFAR-10, SVHN, STL-10, and ImageNet) demonstrate that AugLocal can effectively scale up to tens of local layers with a comparable accuracy to BP-trained networks while reducing GPU memory usage by around 40%. The proposed AugLocal method, therefore, opens up a myriad of opportunities for training high-performance deep neural networks on resource-constrained platforms.Code is available at https://github.com/ChenxiangMA/AugLocal.
Paper Structure (38 sections, 3 equations, 6 figures, 13 tables)

This paper contains 38 sections, 3 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Comparison of supervised local learning rules and BP on CIFAR-10 dataset. ResNet-32 architecture, with 16 local layers, has been used in this experiment.
  • Figure 2: Comparison of (a) end-to-end backpropagation (BP), (b) supervised local learning, and (c) our proposed AugLocal method. Unlike BP, supervised local learning trains each hidden layer with a gradient-isolated auxiliary network. AugLocal constructs the auxiliary networks by uniformly selecting a given number of layers from each hidden layer's subsequent layers. Additionally, the depth of auxiliary networks linearly decreases as the hidden layer approaches the final classifier. Black and red arrows represent forward and gradient propagation during training.
  • Figure 3: Comparison of layer-wise representation similarity. We utilize centered kernel alignment (CKA) pmlr-v97-kornblith19a to measure the layer-wise similarity of representations between BP and other local learning rules. To provide a fair baseline for BP, we measure the similarity between two networks trained with different random seeds. The average CKA similarity scores for different learning rules are provided in the legend.
  • Figure 4: Comparison of layer-wise linear separability across different learning rules.
  • Figure 5: Influence of pyramidal depth on accuracy and computational efficiency. The FLOPs reduction is computed as the relative difference between with and without pyramidal depth. Refer to Table \ref{['Tab:flops']} for specific FLOPs values. Results are obtained using ResNet-110 on CIFAR-10.
  • ...and 1 more figures