Table of Contents
Fetching ...

MAN++: Scaling Momentum Auxiliary Network for Supervised Local Learning in Vision Tasks

Junhao Su, Feiyu Zhu, Hengyu Shi, Tianyang Han, Yurui Qiu, Junfeng Luo, Xiaoming Wei, Jialin Gao

Abstract

Deep learning typically relies on end-to-end backpropagation for training, a method that inherently suffers from issues such as update locking during parameter optimization, high GPU memory consumption, and a lack of biological plausibility. In contrast, supervised local learning seeks to mitigate these challenges by partitioning the network into multiple local blocks and designing independent auxiliary networks to update each block separately. However, because gradients are propagated solely within individual local blocks, performance degradation occurs, preventing supervised local learning from supplanting end-to-end backpropagation. To address these limitations and facilitate inter-block information flow, we propose the Momentum Auxiliary Network++ (MAN++). MAN++ introduces a dynamic interaction mechanism by employing the Exponential Moving Average (EMA) of parameters from adjacent blocks to enhance communication across the network. The auxiliary network, updated via EMA, effectively bridges the information gap between blocks. Notably, we observed that directly applying EMA parameters can be suboptimal due to feature discrepancies between local blocks. To resolve this issue, we introduce a learnable scaling bias that balances feature differences, thereby further improving performance. We validate MAN++ through extensive experiments on tasks that include image classification, object detection, and image segmentation, utilizing multiple network architectures. The experimental results demonstrate that MAN++ achieves performance comparable to end-to-end training while significantly reducing GPU memory usage. Consequently, MAN++ offers a novel perspective for supervised local learning and presents a viable alternative to conventional training methods.

MAN++: Scaling Momentum Auxiliary Network for Supervised Local Learning in Vision Tasks

Abstract

Deep learning typically relies on end-to-end backpropagation for training, a method that inherently suffers from issues such as update locking during parameter optimization, high GPU memory consumption, and a lack of biological plausibility. In contrast, supervised local learning seeks to mitigate these challenges by partitioning the network into multiple local blocks and designing independent auxiliary networks to update each block separately. However, because gradients are propagated solely within individual local blocks, performance degradation occurs, preventing supervised local learning from supplanting end-to-end backpropagation. To address these limitations and facilitate inter-block information flow, we propose the Momentum Auxiliary Network++ (MAN++). MAN++ introduces a dynamic interaction mechanism by employing the Exponential Moving Average (EMA) of parameters from adjacent blocks to enhance communication across the network. The auxiliary network, updated via EMA, effectively bridges the information gap between blocks. Notably, we observed that directly applying EMA parameters can be suboptimal due to feature discrepancies between local blocks. To resolve this issue, we introduce a learnable scaling bias that balances feature differences, thereby further improving performance. We validate MAN++ through extensive experiments on tasks that include image classification, object detection, and image segmentation, utilizing multiple network architectures. The experimental results demonstrate that MAN++ achieves performance comparable to end-to-end training while significantly reducing GPU memory usage. Consequently, MAN++ offers a novel perspective for supervised local learning and presents a viable alternative to conventional training methods.

Paper Structure

This paper contains 43 sections, 25 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of accuracy across different datasets and backbones for both MAN++ and E2E methods. Fig \ref{['Figure 1']}.(a) shows the results of training from scratch on the ImageNet dataset for 90 epochs. Fig \ref{['Figure 1']}.(b) presents the results of training on the COCO dataset for 100 epochs using pretrained weights from ImageNet. Fig \ref{['Figure 1']}.(c) displays the results of training on the CityScapes dataset for 4,000 iterations, also using pretrained weights from ImageNet.
  • Figure 2: Comparison of (a) end-to-end backpropagation, (b) other supervised local learning methods, and (c) our proposed method. Unlike E2E, supervised local learning separates the network into K gradient-isolated local blocks. LB stands for the Learnable Bias.
  • Figure 3: Details of the Momentum Auxiliary Network. Local (i+1) represents the (i+1)-th gradient-isolated local block, which contains layers from layer m to layer (m+n), totaling n+1 layers (n$\geqslant$0). We only use the parameters of the first layer to ensure a balance in GPU memory usage. Specifically, for ResNet, “first layer” refers to the first residual unit (i.e., a basic block or bottleneck), whereas for Vision Transformer, “first layer” refers to the first Transformer block.
  • Figure 4: Training-Accuracy curves, both are utilizing the CIFAR-10 dataset.
  • Figure 5: Grad-CAM visualizations of our methods on the ImageNet-1K validation set. All visualizations were generated using checkpoints trained for 90 epochs on the ImageNet dataset use ResNet-101 ($K=4$).
  • ...and 1 more figures