Table of Contents
Fetching ...

Learning without Global Backpropagation via Synergistic Information Distillation

Chenhao Ye, Ming Tang

TL;DR

Backpropagation struggles with update locking and high memory demands in deep networks. SID replaces global gradient propagation with a cascade of locally optimized belief refinements, decoupling modules via stop-gradient and a two-phase training scheme that enables parallel updates and memory efficiency. Theoretical results show monotonic improvement and exponential convergence of beliefs to the target under ideal conditions, with practical robustness to imperfect optimization and label noise. Empirically, SID matches or surpasses BP, scales favorably with depth, and demonstrates broad generality across architectures, offering a practical alternative for training large-scale models.

Abstract

Backpropagation (BP), while foundational to deep learning, imposes two critical scalability bottlenecks: update locking, where network modules remain idle until the entire backward pass completes, and high memory consumption due to storing activations for gradient computation. To address these limitations, we introduce Synergistic Information Distillation (SID), a novel training framework that reframes deep learning as a cascade of local cooperative refinement problems. In SID, a deep network is structured as a pipeline of modules, each imposed with a local objective to refine a probabilistic belief about the ground-truth target. This objective balances fidelity to the target with consistency to the belief from its preceding module. By decoupling the backward dependencies between modules, SID enables parallel training and hence eliminates update locking and drastically reduces memory requirements. Meanwhile, this design preserves the standard feed-forward inference pass, making SID a versatile drop-in replacement for BP. We provide a theoretical foundation, proving that SID guarantees monotonic performance improvement with network depth. Empirically, SID consistently matches or surpasses the classification accuracy of BP, exhibiting superior scalability and pronounced robustness to label noise.Code is available at: https://github.com/ychAlbert/sid-bp

Learning without Global Backpropagation via Synergistic Information Distillation

TL;DR

Backpropagation struggles with update locking and high memory demands in deep networks. SID replaces global gradient propagation with a cascade of locally optimized belief refinements, decoupling modules via stop-gradient and a two-phase training scheme that enables parallel updates and memory efficiency. Theoretical results show monotonic improvement and exponential convergence of beliefs to the target under ideal conditions, with practical robustness to imperfect optimization and label noise. Empirically, SID matches or surpasses BP, scales favorably with depth, and demonstrates broad generality across architectures, offering a practical alternative for training large-scale models.

Abstract

Backpropagation (BP), while foundational to deep learning, imposes two critical scalability bottlenecks: update locking, where network modules remain idle until the entire backward pass completes, and high memory consumption due to storing activations for gradient computation. To address these limitations, we introduce Synergistic Information Distillation (SID), a novel training framework that reframes deep learning as a cascade of local cooperative refinement problems. In SID, a deep network is structured as a pipeline of modules, each imposed with a local objective to refine a probabilistic belief about the ground-truth target. This objective balances fidelity to the target with consistency to the belief from its preceding module. By decoupling the backward dependencies between modules, SID enables parallel training and hence eliminates update locking and drastically reduces memory requirements. Meanwhile, this design preserves the standard feed-forward inference pass, making SID a versatile drop-in replacement for BP. We provide a theoretical foundation, proving that SID guarantees monotonic performance improvement with network depth. Empirically, SID consistently matches or surpasses the classification accuracy of BP, exhibiting superior scalability and pronounced robustness to label noise.Code is available at: https://github.com/ychAlbert/sid-bp

Paper Structure

This paper contains 57 sections, 6 theorems, 30 equations, 9 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

Suppose Assumption assum:optimal_updates holds. Given an initial uniform belief $p_0$, the belief $p_i$ after module $i$ is a geometric interpolation between $p_0$ and $p_y$: The symbol $\propto$ denotes proportionality up to a normalization constant. Consequently, as the number of modules $i \to \infty$, the belief $p_i$ converges pointwise to the target distribution $p_y$. The rate of convergen

Figures (9)

  • Figure 1: Comparison of BP and SID Training Paradigms. While BP's end-to-end gradient propagation (top) creates a sequential bottleneck, SID (bottom) introduces a two-phase process. Phase 1 performs a single gradient-free forward pass to generate a set of fixed "teacher" beliefs. Phase 2 then uses these local teachers to update all modules in parallel, fundamentally resolving update locking and enabling scalable training.
  • Figure 2: Test accuracy convergence curves on CIFAR-10 (left), CIFAR-100 (center), and Tiny-ImageNet (right). SID (blue, bold) demonstrates superior performance on more complex datasets. This performance gain is achieved alongside SID's significant reductions in time and memory complexity (see Section 4.3).
  • Figure 3: Theoretical speedup of SID over BP.
  • Figure 4: Depth scaling on CIFAR-100. Left: On a SimpleCNN, SID's accuracy improves with depth while BP's degrades. Right: On a ResNet, SID shows more consistent improvement.
  • Figure 5: Left: CKA similarity between SID and BP beliefs. The vertical stripe in the final column highlights SID's "converge-then-refine" dynamic. Right: Belief evolution on hard examples (misclassified by at least one model). SID's belief in the true class (blue) shows a more stable and monotonic progression than BP's (orange).
  • ...and 4 more figures

Theorems & Definitions (13)

  • Proposition 1: Closed-Form Cascade and Exponential Convergence
  • proof
  • Proposition 2: Monotonic Descent Guarantee
  • proof
  • Proposition 3: Computational and Memory Complexity
  • proof
  • Lemma 1: Closed-Form Local Minimizer
  • proof : Proof of Lemma \ref{['lem:local_minimizer']}
  • proof : Proof of Proposition \ref{['prop:cascade_convergence']}
  • Proposition 4: Monotonic Descent with Bounded Error
  • ...and 3 more