Table of Contents
Fetching ...

Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

Yibo Yang, Xiaojie Li, Motasem Alfarra, Hasan Hammoud, Adel Bibi, Philip Torr, Bernard Ghanem

TL;DR

The paper tackles the problem of training deep networks without full back-propagation by examining non-greedy local learning, which can fail to converge due to gradient misalignment between neighboring layers. It introduces successive gradient reconciliation (SGR), a gradient-distance regularizer added to local losses that reconciles adjacent layers while preserving gradient isolation and without extra learnable parameters, enabling both local-BP and BP-free training. Theoretical analysis under the PL condition shows how gradient discord affects convergence and how SGR mitigates this issue, while empirical results on CIFAR and ImageNet demonstrate substantial memory savings (over 40% for CNNs and Transformers) with competitive accuracy and robust ablations. This approach offers a principled, memory-efficient alternative to global BP with potential for scalable, biologically inspired learning and large-model finetuning.

Abstract

Relieving the reliance of neural network training on a global back-propagation (BP) has emerged as a notable research topic due to the biological implausibility and huge memory consumption caused by BP. Among the existing solutions, local learning optimizes gradient-isolated modules of a neural network with local errors and has been proved to be effective even on large-scale datasets. However, the reconciliation among local errors has never been investigated. In this paper, we first theoretically study non-greedy layer-wise training and show that the convergence cannot be assured when the local gradient in a module w.r.t. its input is not reconciled with the local gradient in the previous module w.r.t. its output. Inspired by the theoretical result, we further propose a local training strategy that successively regularizes the gradient reconciliation between neighboring modules without breaking gradient isolation or introducing any learnable parameters. Our method can be integrated into both local-BP and BP-free settings. In experiments, we achieve significant performance improvements compared to previous methods. Particularly, our method for CNN and Transformer architectures on ImageNet is able to attain a competitive performance with global BP, saving more than 40% memory consumption.

Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

TL;DR

The paper tackles the problem of training deep networks without full back-propagation by examining non-greedy local learning, which can fail to converge due to gradient misalignment between neighboring layers. It introduces successive gradient reconciliation (SGR), a gradient-distance regularizer added to local losses that reconciles adjacent layers while preserving gradient isolation and without extra learnable parameters, enabling both local-BP and BP-free training. Theoretical analysis under the PL condition shows how gradient discord affects convergence and how SGR mitigates this issue, while empirical results on CIFAR and ImageNet demonstrate substantial memory savings (over 40% for CNNs and Transformers) with competitive accuracy and robust ablations. This approach offers a principled, memory-efficient alternative to global BP with potential for scalable, biologically inspired learning and large-model finetuning.

Abstract

Relieving the reliance of neural network training on a global back-propagation (BP) has emerged as a notable research topic due to the biological implausibility and huge memory consumption caused by BP. Among the existing solutions, local learning optimizes gradient-isolated modules of a neural network with local errors and has been proved to be effective even on large-scale datasets. However, the reconciliation among local errors has never been investigated. In this paper, we first theoretically study non-greedy layer-wise training and show that the convergence cannot be assured when the local gradient in a module w.r.t. its input is not reconciled with the local gradient in the previous module w.r.t. its output. Inspired by the theoretical result, we further propose a local training strategy that successively regularizes the gradient reconciliation between neighboring modules without breaking gradient isolation or introducing any learnable parameters. Our method can be integrated into both local-BP and BP-free settings. In experiments, we achieve significant performance improvements compared to previous methods. Particularly, our method for CNN and Transformer architectures on ImageNet is able to attain a competitive performance with global BP, saving more than 40% memory consumption.
Paper Structure (19 sections, 9 theorems, 54 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 9 theorems, 54 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Theorem 3.3

Based on Assumptions assump1 and assump2, if the learning rates are set as $\eta_1^{(i)}=\eta_1$ and $\eta_2^{(i)}=\eta_2$, where we have the following convergence and recursively applying Eq. (convergence1) we have where ${\mathcal{L}}_2^{(i,i)}$ denotes ${\mathcal{L}}_2({\bm{\theta}}_1^{(i)}, {\bm{\theta}}_2^{(i)})$, i.e., the second layer loss value with parameters ${\bm{\theta}}_1^{(i)}$ an

Figures (4)

  • Figure 1: An illustration to compare our method with non-greedy local learning and global BP. The blue arrows indicate forward propagation, while the red solid arrows and red dashed arrows denote the backward gradient w.r.t. the output feature and the input feature of each block, respectively. In global BP, gradients are passed into prior blocks to update the parameters, but the updates in local learning may be deviated by local errors. Our method successively reconciles local updates in a forward mode without breaking gradient isolation.
  • Figure 2: The training curves of average ${\mathcal{L}}^{(SGR)}$ of all neighboring layers and accuracy on test set.
  • Figure 3: We measure the change of classification loss in each layer caused by the input update from prior layers using a 4-layer PlainNet. The black dashed line denotes the zero baseline such that the area below it means that the updates of prior layers can produce a new output feature enabling to reduce the loss value of the current layer as its input.
  • Figure 4: Train loss (left), train accuracy (middle), and test loss (right) curves with and without our method. The corresponding SGR loss value and test accuracy curves are shown in Figure \ref{['curve']}.

Theorems & Definitions (15)

  • Theorem 3.3
  • Proposition 3.4
  • Proposition 3.5
  • Theorem 1.3
  • Lemma 1.4
  • Lemma 1.5
  • proof : Proof of Lemma \ref{['lemma1']}
  • proof : Proof of Lemma \ref{['lemma2']}
  • proof : Proof of Theorem \ref{['app:theorem']}
  • Proposition 2.1
  • ...and 5 more