Table of Contents
Fetching ...

On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training

Zhaoxian Wu, Quan Xiao, Tayfun Gokmen, Hsinyu Tsai, Kaoutar El Maghraoui, Tianyi Chen

TL;DR

This paper analyzes the convergence of stochastic gradient descent for multi-layer DNNs trained on analog in-memory computing (AIMC) hardware using asynchronous pipeline parallelism (Analog-SGD-AP). It models both stale forward/backward signals and analog weight-update bias, and proves convergence with iteration complexity $O\left(\varepsilon^{-2}+\varepsilon^{-1}\right)$, matching the dominant terms of digital SGD and synchronous AIMC, up to a higher-order $O(\varepsilon^{-1})$ term. The work shows that asynchronous pipeline achieves maximal computation density (1) and yields favorable wall-clock speedups over synchronous variants, supported by AIMC simulator results; however, real hardware validation is left for future work. Overall, the results provide a theoretical foundation for scalable AIMC training via asynchronous pipeline, balancing convergence with practical throughput gains.

Abstract

Aiming to accelerate the training of large deep neural networks (DNN) in an energy-efficient way, analog in-memory computing (AIMC) emerges as a solution with immense potential. AIMC accelerator keeps model weights in memory without moving them from memory to processors during training, reducing overhead dramatically. Despite its efficiency, scaling up AIMC systems presents significant challenges. Since weight copying is expensive and inaccurate, data parallelism is less efficient on AIMC accelerators. It necessitates the exploration of pipeline parallelism, particularly asynchronous pipeline parallelism, which utilizes all available accelerators during the training process. This paper examines the convergence theory of stochastic gradient descent on AIMC hardware with an asynchronous pipeline (Analog-SGD-AP). Although there is empirical exploration of AIMC accelerators, the theoretical understanding of how analog hardware imperfections in weight updates affect the training of multi-layer DNN models remains underexplored. Furthermore, the asynchronous pipeline parallelism results in stale weights issues, which render the update signals no longer valid gradients. To close the gap, this paper investigates the convergence properties of Analog-SGD-AP on multi-layer DNN training. We show that the Analog-SGD-AP converges with iteration complexity $O(\varepsilon^{-2}+\varepsilon^{-1})$ despite the aforementioned issues, which matches the complexities of digital SGD and Analog SGD with synchronous pipeline, except the non-dominant term $O(\varepsilon^{-1})$. It implies that AIMC training benefits from asynchronous pipelining almost for free compared with the synchronous pipeline by overlapping computation.

On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training

TL;DR

This paper analyzes the convergence of stochastic gradient descent for multi-layer DNNs trained on analog in-memory computing (AIMC) hardware using asynchronous pipeline parallelism (Analog-SGD-AP). It models both stale forward/backward signals and analog weight-update bias, and proves convergence with iteration complexity , matching the dominant terms of digital SGD and synchronous AIMC, up to a higher-order term. The work shows that asynchronous pipeline achieves maximal computation density (1) and yields favorable wall-clock speedups over synchronous variants, supported by AIMC simulator results; however, real hardware validation is left for future work. Overall, the results provide a theoretical foundation for scalable AIMC training via asynchronous pipeline, balancing convergence with practical throughput gains.

Abstract

Aiming to accelerate the training of large deep neural networks (DNN) in an energy-efficient way, analog in-memory computing (AIMC) emerges as a solution with immense potential. AIMC accelerator keeps model weights in memory without moving them from memory to processors during training, reducing overhead dramatically. Despite its efficiency, scaling up AIMC systems presents significant challenges. Since weight copying is expensive and inaccurate, data parallelism is less efficient on AIMC accelerators. It necessitates the exploration of pipeline parallelism, particularly asynchronous pipeline parallelism, which utilizes all available accelerators during the training process. This paper examines the convergence theory of stochastic gradient descent on AIMC hardware with an asynchronous pipeline (Analog-SGD-AP). Although there is empirical exploration of AIMC accelerators, the theoretical understanding of how analog hardware imperfections in weight updates affect the training of multi-layer DNN models remains underexplored. Furthermore, the asynchronous pipeline parallelism results in stale weights issues, which render the update signals no longer valid gradients. To close the gap, this paper investigates the convergence properties of Analog-SGD-AP on multi-layer DNN training. We show that the Analog-SGD-AP converges with iteration complexity despite the aforementioned issues, which matches the complexities of digital SGD and Analog SGD with synchronous pipeline, except the non-dominant term . It implies that AIMC training benefits from asynchronous pipelining almost for free compared with the synchronous pipeline by overlapping computation.

Paper Structure

This paper contains 20 sections, 107 equations, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: Illustration of MVM computation in AIMC accelerators. The weight$W^{(m)}$ on layer $m$ is stored in a crossbar tile. The $(i,j)$-th element of $W^{(m)}$ is represented by the conductance of the $(i,j)$-th resistive element. MVM operation $z^{(m)} = W^{(m)}x^{(m)}$ is performed by applying voltage$[x^{(m)}]_j$ between $j$-th and $(j+1)$-th row. By Ohm's law, the current is $I_{ij}=[W^{(m)}]_{ij}[x^{(m)}]_{j}$; and by Kirchhoff's law, the total current on the $i$-th column is $\sum_{j}I_{ij}=\sum_{j}[W^{(m)}]_{ij}[x^{(m)}]_{j}$. Unlike the digital counterpart, no movement of $W^{(m)}$ is required for MVM calculation.
  • Figure 2: Illustration of pipelines with 4 accelerators ($M=4$). Each mini-batch is split into $B=4$ micro-batches. Each color represents one micro-batch, and each row from bottom to top represents stages 0 to 4. Each column corresponds to a single clock cycle in which one micro-batch is processed. The white square indicates the corresponding accelerator is idle. (Top)Analog- SGD- WOP. All weights are updated after each mini-batch is computed, which is not presented in the figure. (Middle)Analog- SGD- SP with micro-batch count $B=5$. The update occurs after all 5 micro-batches are processed. (Bottom)Analog- SGD- AP. The update occurs once the gradient of a single micro-batch is achieved. The grey square indicates that the corresponding accelerator is processing data that is not fully reflected in the figure.
  • Figure 3: Illustration of the dynamics of the asynchronous pipeline. The $W_k^{(m)}$ in the circle implies the update happens in this clock cycle, and the symbols in the squares indicate the input of each accelerator.
  • Figure 4: Training ResNet models on CIFAR10 dataset via vanilla model parallelism without pipeline (wo pipeline), synchronous pipeline (Sync), and asynchronous pipeline (Asyn). (Left) Accuracy-vs-Epoch reflects sample complexity. (Middle) Accuracy-vs-Clock cycle. (Right) Speedup of the proposed pipeline training methods on AIMC accelerators as the number of stages increases compared with the single machine training case (stage=1). We ignore the communication latency since the asynchronous pipeline does not introduce obvious extra latency.

Theorems & Definitions (9)

  • proof : Proof of Lemma \ref{['lemma:signal-delay-short']} and Lemma \ref{['lemma:signal-delay']}
  • proof : Proof of Lemma \ref{['lemma:gradient-delay']}
  • proof
  • proof : Proof of Lemma \ref{['lemma:bounded-signal']}
  • proof : Proof of Lemma \ref{['lemma:signal-stability-short']} and Lemma \ref{['lemma:signal-stability']}
  • proof : Proof of Lemma \ref{['lemma:objective-L-smooth-short']} and Lemma \ref{['lemma:objective-L-smooth']}
  • proof : Proof of Theorem \ref{['theorem:ASGD-sync-convergence-noncvx-linear']}
  • proof : Proof of Theorem \ref{['theorem:ASGD-async-convergence-noncvx-linear']}
  • proof