Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply
Chengting Yu, Fengzhao Zhang, Hanzhi Ma, Aili Wang, Erping Li
TL;DR
This paper analyzes the memory-efficient Greedy Local Learning (GLL) paradigm through an information-theoretic lens and identifies an irreversible loss of task-relevant information, $I(h_l,y)$, as the core bottleneck that degrades performance when the network is split into many gradient-isolated modules. It introduces Context Supply (ContSup), a simple yet effective mechanism that injects context $c_l$ into intermediate features via $h_{l-1}^c = h_{l-1} + m{ etwork{M}}^l(c_l)$, with priors [E] and [R1] to supply information from the origin input or earlier modules, and optionally richer topologies (RnE). Theoretical results explain how ContSup can compensate for lost information and mitigate the confirmed habits dilemma, while experiments on CIFAR-10, SVHN, and STL-10 demonstrate state-of-the-art performance for greedy local learning with significantly reduced memory overhead, maintaining accuracy as the number of modules grows. The work provides practical guidance for scalable, memory-efficient local learning and offers a bridge toward more biologically plausible modular training schemes. Overall, ContSup enhances the potency of GLL by reintroducing context, enabling deeper modular decompositions with strong performance and efficiency gains. All mathematical relations are expressed with clear information-theoretic terms, including $I(h_l,y)$, $I(x,y)$, and related bounds, to quantify information flow and restoration across modules.
Abstract
Traditional end-to-end (E2E) training of deep networks necessitates storing intermediate activations for back-propagation, resulting in a large memory footprint on GPUs and restricted model parallelization. As an alternative, greedy local learning partitions the network into gradient-isolated modules and trains supervisely based on local preliminary losses, thereby providing asynchronous and parallel training methods that substantially reduce memory cost. However, empirical experiments reveal that as the number of segmentations of the gradient-isolated module increases, the performance of the local learning scheme degrades substantially, severely limiting its expansibility. To avoid this issue, we theoretically analyze the greedy local learning from the standpoint of information theory and propose a ContSup scheme, which incorporates context supply between isolated modules to compensate for information loss. Experiments on benchmark datasets (i.e. CIFAR, SVHN, STL-10) achieve SOTA results and indicate that our proposed method can significantly improve the performance of greedy local learning with minimal memory and computational overhead, allowing for the boost of the number of isolated modules. Our codes are available at https://github.com/Tab-ct/ContSup.
