Learning without Global Backpropagation via Synergistic Information Distillation

Chenhao Ye; Ming Tang

Learning without Global Backpropagation via Synergistic Information Distillation

Chenhao Ye, Ming Tang

TL;DR

Backpropagation struggles with update locking and high memory demands in deep networks. SID replaces global gradient propagation with a cascade of locally optimized belief refinements, decoupling modules via stop-gradient and a two-phase training scheme that enables parallel updates and memory efficiency. Theoretical results show monotonic improvement and exponential convergence of beliefs to the target under ideal conditions, with practical robustness to imperfect optimization and label noise. Empirically, SID matches or surpasses BP, scales favorably with depth, and demonstrates broad generality across architectures, offering a practical alternative for training large-scale models.

Abstract

Backpropagation (BP), while foundational to deep learning, imposes two critical scalability bottlenecks: update locking, where network modules remain idle until the entire backward pass completes, and high memory consumption due to storing activations for gradient computation. To address these limitations, we introduce Synergistic Information Distillation (SID), a novel training framework that reframes deep learning as a cascade of local cooperative refinement problems. In SID, a deep network is structured as a pipeline of modules, each imposed with a local objective to refine a probabilistic belief about the ground-truth target. This objective balances fidelity to the target with consistency to the belief from its preceding module. By decoupling the backward dependencies between modules, SID enables parallel training and hence eliminates update locking and drastically reduces memory requirements. Meanwhile, this design preserves the standard feed-forward inference pass, making SID a versatile drop-in replacement for BP. We provide a theoretical foundation, proving that SID guarantees monotonic performance improvement with network depth. Empirically, SID consistently matches or surpasses the classification accuracy of BP, exhibiting superior scalability and pronounced robustness to label noise.Code is available at: https://github.com/ychAlbert/sid-bp

Learning without Global Backpropagation via Synergistic Information Distillation

TL;DR

Abstract

Learning without Global Backpropagation via Synergistic Information Distillation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (13)