Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply

Chengting Yu; Fengzhao Zhang; Hanzhi Ma; Aili Wang; Erping Li

Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply

Chengting Yu, Fengzhao Zhang, Hanzhi Ma, Aili Wang, Erping Li

TL;DR

This paper analyzes the memory-efficient Greedy Local Learning (GLL) paradigm through an information-theoretic lens and identifies an irreversible loss of task-relevant information, $I(h_l,y)$, as the core bottleneck that degrades performance when the network is split into many gradient-isolated modules. It introduces Context Supply (ContSup), a simple yet effective mechanism that injects context $c_l$ into intermediate features via $h_{l-1}^c = h_{l-1} + m{ etwork{M}}^l(c_l)$, with priors [E] and [R1] to supply information from the origin input or earlier modules, and optionally richer topologies (RnE). Theoretical results explain how ContSup can compensate for lost information and mitigate the confirmed habits dilemma, while experiments on CIFAR-10, SVHN, and STL-10 demonstrate state-of-the-art performance for greedy local learning with significantly reduced memory overhead, maintaining accuracy as the number of modules grows. The work provides practical guidance for scalable, memory-efficient local learning and offers a bridge toward more biologically plausible modular training schemes. Overall, ContSup enhances the potency of GLL by reintroducing context, enabling deeper modular decompositions with strong performance and efficiency gains. All mathematical relations are expressed with clear information-theoretic terms, including $I(h_l,y)$, $I(x,y)$, and related bounds, to quantify information flow and restoration across modules.

Abstract

Traditional end-to-end (E2E) training of deep networks necessitates storing intermediate activations for back-propagation, resulting in a large memory footprint on GPUs and restricted model parallelization. As an alternative, greedy local learning partitions the network into gradient-isolated modules and trains supervisely based on local preliminary losses, thereby providing asynchronous and parallel training methods that substantially reduce memory cost. However, empirical experiments reveal that as the number of segmentations of the gradient-isolated module increases, the performance of the local learning scheme degrades substantially, severely limiting its expansibility. To avoid this issue, we theoretically analyze the greedy local learning from the standpoint of information theory and propose a ContSup scheme, which incorporates context supply between isolated modules to compensate for information loss. Experiments on benchmark datasets (i.e. CIFAR, SVHN, STL-10) achieve SOTA results and indicate that our proposed method can significantly improve the performance of greedy local learning with minimal memory and computational overhead, allowing for the boost of the number of isolated modules. Our codes are available at https://github.com/Tab-ct/ContSup.

Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply

TL;DR

This paper analyzes the memory-efficient Greedy Local Learning (GLL) paradigm through an information-theoretic lens and identifies an irreversible loss of task-relevant information,

, as the core bottleneck that degrades performance when the network is split into many gradient-isolated modules. It introduces Context Supply (ContSup), a simple yet effective mechanism that injects context

into intermediate features via

, with priors [E] and [R1] to supply information from the origin input or earlier modules, and optionally richer topologies (RnE). Theoretical results explain how ContSup can compensate for lost information and mitigate the confirmed habits dilemma, while experiments on CIFAR-10, SVHN, and STL-10 demonstrate state-of-the-art performance for greedy local learning with significantly reduced memory overhead, maintaining accuracy as the number of modules grows. The work provides practical guidance for scalable, memory-efficient local learning and offers a bridge toward more biologically plausible modular training schemes. Overall, ContSup enhances the potency of GLL by reintroducing context, enabling deeper modular decompositions with strong performance and efficiency gains. All mathematical relations are expressed with clear information-theoretic terms, including

, and related bounds, to quantify information flow and restoration across modules.

Abstract

Paper Structure (29 sections, 40 equations, 11 figures, 12 tables, 2 algorithms)

This paper contains 29 sections, 40 equations, 11 figures, 12 tables, 2 algorithms.

Introduction
Theoretical Analysis of Greedy Local Learning
Preliminary
Naïve Greedy Local Learning suffers a dilemma called confirmed habits
The local reconstruction eases short-sight but still in confirmed habits.
Context Supply gives chance to escape the dilemma
Experiments
Setup
Comparison with State-of-the-Arts
Ablation Study
Extension toward Topological Connection
Weight Visualization
Conclusion
More about mutual information in GLL
Proofs and extra discussion on theoretical results
...and 14 more sections

Figures (11)

Figure 1: Dataflow in training schemes. (a) The standard paradigm that back-propagates errors end-to-end in reverse. (b) Greedy learning occurred with local-defined objectives. (c) ContSup provides the context path in addition to feature paths, and consists of two portions with element-wise addition to preserve the same shapes of features and computations within modules.
Figure 2: An information-theoretic perspective in GLL. (a) illustrates information trends via network depth, showing the monotonically decreasing of $I(h_l,y)$ and general upward trend of $I(\hat{y_l},y)$, where the final performance is obtained by progressive improvement. (b) shows that local reconstruction efforts to alleviate the short-sight issue are impeded by the confirmed habit. (c) illustrates the function of context that enables a local module to surpass its obstruction.
Figure 3: Comparisons of ContSup and state-of-the-art GLL methods in terms of the test errors. The best results of methods based on $K$-partitioned ResNet-32 and CIFAR-10 are reported.
Figure 4: Comparison of the GLL methods' test errors on the CIFAR-10 as a function of GPU memory footprint. Results of training both ResNet32 and ResNet110 on a single Tesla V100-PCIE-32GB GPU are reported.
Figure 5: Ablation studies. Test errors of ResNet-32 on CIFAR-10 are reported.
...and 6 more figures

Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply

TL;DR

Abstract

Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply

Authors

TL;DR

Abstract

Table of Contents

Figures (11)