Table of Contents
Fetching ...

ADMM Algorithms for Residual Network Training: Convergence Analysis and Parallel Implementation

Jintao Xu, Yifei Li, Wenxun Xing

TL;DR

The paper tackles scalable training of residual networks by moving beyond backpropagation and exploiting proximal (linearized) ADMM (RADMM) techniques. It develops two-splitting and three-splitting relaxations that enable parallel regional updates (PRU) and distributed training, with formal convergence guarantees (to KKT points) and rates governed by the KL exponent, independent of network width, depth, or data size. The authors prove both iteration and objective-value convergence, derive time-complexity and memory benefits for parallel implementations, and provide a practical Python-based control protocol for parallel RADMM execution. Empirical results on the Wine Quality dataset show fast, stable convergence, superior performance for deep models, and meaningful speedups from parallelization, validating the approach's practical impact for large-scale residual-network training.

Abstract

We propose both serial and parallel proximal (linearized) alternating direction method of multipliers (ADMM) algorithms for training residual neural networks. In contrast to backpropagation-based approaches, our methods inherently mitigate the exploding gradient issue and are well-suited for parallel and distributed training through regional updates. Theoretically, we prove that the proposed algorithms converge at an R-linear (sublinear) rate for both the iteration points and the objective function values. These results hold without imposing stringent constraints on network width, depth, or training data size. Furthermore, we theoretically analyze our parallel/distributed ADMM algorithms, highlighting their reduced time complexity and lower per-node memory consumption. To facilitate practical deployment, we develop a control protocol for parallel ADMM implementation using Python's multiprocessing and interprocess communication. Experimental results validate the proposed ADMM algorithms, demonstrating rapid and stable convergence, improved performance, and high computational efficiency. Finally, we highlight the improved scalability and efficiency achieved by our parallel ADMM training strategy.

ADMM Algorithms for Residual Network Training: Convergence Analysis and Parallel Implementation

TL;DR

The paper tackles scalable training of residual networks by moving beyond backpropagation and exploiting proximal (linearized) ADMM (RADMM) techniques. It develops two-splitting and three-splitting relaxations that enable parallel regional updates (PRU) and distributed training, with formal convergence guarantees (to KKT points) and rates governed by the KL exponent, independent of network width, depth, or data size. The authors prove both iteration and objective-value convergence, derive time-complexity and memory benefits for parallel implementations, and provide a practical Python-based control protocol for parallel RADMM execution. Empirical results on the Wine Quality dataset show fast, stable convergence, superior performance for deep models, and meaningful speedups from parallelization, validating the approach's practical impact for large-scale residual-network training.

Abstract

We propose both serial and parallel proximal (linearized) alternating direction method of multipliers (ADMM) algorithms for training residual neural networks. In contrast to backpropagation-based approaches, our methods inherently mitigate the exploding gradient issue and are well-suited for parallel and distributed training through regional updates. Theoretically, we prove that the proposed algorithms converge at an R-linear (sublinear) rate for both the iteration points and the objective function values. These results hold without imposing stringent constraints on network width, depth, or training data size. Furthermore, we theoretically analyze our parallel/distributed ADMM algorithms, highlighting their reduced time complexity and lower per-node memory consumption. To facilitate practical deployment, we develop a control protocol for parallel ADMM implementation using Python's multiprocessing and interprocess communication. Experimental results validate the proposed ADMM algorithms, demonstrating rapid and stable convergence, improved performance, and high computational efficiency. Finally, we highlight the improved scalability and efficiency achieved by our parallel ADMM training strategy.
Paper Structure (42 sections, 11 theorems, 46 equations, 7 figures, 4 tables, 4 algorithms)

This paper contains 42 sections, 11 theorems, 46 equations, 7 figures, 4 tables, 4 algorithms.

Key Result

Proposition 1

$\mathcal{L}_{\beta}^{2s}(\Psi_{2s}^{k+1})\le\mathcal{L}_{\beta}^{2s}(\Psi_{2s}^{k})-c_1\Vert \Psi_{2s}^{k+1}-\Psi_{2s}^{k}\Vert^2$ with some $c_1>0$.

Figures (7)

  • Figure 1: Relationships between training problem, relaxations, and RADMMs.
  • Figure 2: Relationship between $\mathcal{L}_{\beta}^{3s}(\Psi_{3s})$ and $\mathcal{L}_R^{3s}(\widetilde{\Psi}_{3s})$.
  • Figure 3: PRU employed in parallel RADMMs.
  • Figure 4: Pipelined update pattern of the parallel two-splitting RADMMs.
  • Figure 5: MSE train (left), test (right) losses for the 40-layer ReLU (up), sigmoid (down) residual network on Wine Quality dataset.
  • ...and 2 more figures

Theorems & Definitions (17)

  • Definition 1: Bertsekas2015
  • Definition 2: Mordukhovich2006Rockafellar1998
  • Definition 3: Mordukhovich2006Rockafellar1998
  • Definition 4: Bertsekas2015
  • Definition 5: Attouch2010Li2018
  • Definition 6: Krantz2002
  • Proposition 1: Proof in Appendix \ref{['app:1']}
  • Proposition 2: Proof in Appendix \ref{['app:2']}
  • Theorem 1: Convergence, proof in Appendix \ref{['proof of theorems 1 and 2']}
  • Theorem 2: Convergence Rate, proof in Appendix \ref{['proof of theorems 1 and 2']}
  • ...and 7 more