ADMM Algorithms for Residual Network Training: Convergence Analysis and Parallel Implementation

Jintao Xu; Yifei Li; Wenxun Xing

ADMM Algorithms for Residual Network Training: Convergence Analysis and Parallel Implementation

Jintao Xu, Yifei Li, Wenxun Xing

TL;DR

The paper tackles scalable training of residual networks by moving beyond backpropagation and exploiting proximal (linearized) ADMM (RADMM) techniques. It develops two-splitting and three-splitting relaxations that enable parallel regional updates (PRU) and distributed training, with formal convergence guarantees (to KKT points) and rates governed by the KL exponent, independent of network width, depth, or data size. The authors prove both iteration and objective-value convergence, derive time-complexity and memory benefits for parallel implementations, and provide a practical Python-based control protocol for parallel RADMM execution. Empirical results on the Wine Quality dataset show fast, stable convergence, superior performance for deep models, and meaningful speedups from parallelization, validating the approach's practical impact for large-scale residual-network training.

Abstract

We propose both serial and parallel proximal (linearized) alternating direction method of multipliers (ADMM) algorithms for training residual neural networks. In contrast to backpropagation-based approaches, our methods inherently mitigate the exploding gradient issue and are well-suited for parallel and distributed training through regional updates. Theoretically, we prove that the proposed algorithms converge at an R-linear (sublinear) rate for both the iteration points and the objective function values. These results hold without imposing stringent constraints on network width, depth, or training data size. Furthermore, we theoretically analyze our parallel/distributed ADMM algorithms, highlighting their reduced time complexity and lower per-node memory consumption. To facilitate practical deployment, we develop a control protocol for parallel ADMM implementation using Python's multiprocessing and interprocess communication. Experimental results validate the proposed ADMM algorithms, demonstrating rapid and stable convergence, improved performance, and high computational efficiency. Finally, we highlight the improved scalability and efficiency achieved by our parallel ADMM training strategy.

ADMM Algorithms for Residual Network Training: Convergence Analysis and Parallel Implementation

TL;DR

Abstract

Paper Structure (42 sections, 11 theorems, 46 equations, 7 figures, 4 tables, 4 algorithms)

This paper contains 42 sections, 11 theorems, 46 equations, 7 figures, 4 tables, 4 algorithms.

Introduction
Related Work
Alternatives to BP-Based Training Algorithms
Alternating Direction Method of Multipliers
DNNs Parallel Training
Preliminaries
Notations
Optimization and Variational Analysis
Two-splitting RADMMs
Two-Splitting Proximal (Linearized) RADMMs
Convergence (Rate) of Two-Splitting RADMMs
Three-splitting RADMMs
Three-Splitting Proximal (Linearized) RADMMs
Convergence (Rate) of Three-Splitting RADMMs
Regularized Augmented Lagrangian
...and 27 more sections

Key Result

Proposition 1

$\mathcal{L}_{\beta}^{2s}(\Psi_{2s}^{k+1})\le\mathcal{L}_{\beta}^{2s}(\Psi_{2s}^{k})-c_1\Vert \Psi_{2s}^{k+1}-\Psi_{2s}^{k}\Vert^2$ with some $c_1>0$.

Figures (7)

Figure 1: Relationships between training problem, relaxations, and RADMMs.
Figure 2: Relationship between $\mathcal{L}_{\beta}^{3s}(\Psi_{3s})$ and $\mathcal{L}_R^{3s}(\widetilde{\Psi}_{3s})$.
Figure 3: PRU employed in parallel RADMMs.
Figure 4: Pipelined update pattern of the parallel two-splitting RADMMs.
Figure 5: MSE train (left), test (right) losses for the 40-layer ReLU (up), sigmoid (down) residual network on Wine Quality dataset.
...and 2 more figures

Theorems & Definitions (17)

Definition 1: Bertsekas2015
Definition 2: Mordukhovich2006Rockafellar1998
Definition 3: Mordukhovich2006Rockafellar1998
Definition 4: Bertsekas2015
Definition 5: Attouch2010Li2018
Definition 6: Krantz2002
Proposition 1: Proof in Appendix \ref{['app:1']}
Proposition 2: Proof in Appendix \ref{['app:2']}
Theorem 1: Convergence, proof in Appendix \ref{['proof of theorems 1 and 2']}
Theorem 2: Convergence Rate, proof in Appendix \ref{['proof of theorems 1 and 2']}
...and 7 more

ADMM Algorithms for Residual Network Training: Convergence Analysis and Parallel Implementation

TL;DR

Abstract

ADMM Algorithms for Residual Network Training: Convergence Analysis and Parallel Implementation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (17)