Table of Contents
Fetching ...

Towards Exact Gradient-based Training on Analog In-memory Computing

Zhaoxian Wu, Tayfun Gokmen, Malte J. Rasch, Tianyi Chen

TL;DR

This paper addresses the challenge of gradient-based training on analog in-memory computing (AIMC) hardware, where asymmetric updates and device non-idealities induce a nonzero asymptotic error for standard SGD. It develops a physics-grounded discrete-time model for Analog SGD and proves a lower bound on the irreducible error, then introduces Tiki-Taka, an auxiliary-gradient scheme that eliminates the asymptotic error and achieves convergence to a critical point with a rate matching the stochastic lower bound up to constants. Numerical simulations on synthetic tasks and real datasets corroborate the theory, showing that Tiki-Taka can match or exceed digital SGD performance under realistic AIMC non-idealities. The results provide a principled pathway to exact gradient-based training on AIMC and highlight the practical significance for energy-efficient, scalable AI training on analog hardware.

Abstract

Given the high economic and environmental costs of using large vision or language models, analog in-memory accelerators present a promising solution for energy-efficient AI. While inference on analog accelerators has been studied recently, the training perspective is underexplored. Recent studies have shown that the "workhorse" of digital AI training - stochastic gradient descent (SGD) algorithm converges inexactly when applied to model training on non-ideal devices. This paper puts forth a theoretical foundation for gradient-based training on analog devices. We begin by characterizing the non-convergent issue of SGD, which is caused by the asymmetric updates on the analog devices. We then provide a lower bound of the asymptotic error to show that there is a fundamental performance limit of SGD-based analog training rather than an artifact of our analysis. To address this issue, we study a heuristic analog algorithm called Tiki-Taka that has recently exhibited superior empirical performance compared to SGD and rigorously show its ability to exactly converge to a critical point and hence eliminates the asymptotic error. The simulations verify the correctness of the analyses.

Towards Exact Gradient-based Training on Analog In-memory Computing

TL;DR

This paper addresses the challenge of gradient-based training on analog in-memory computing (AIMC) hardware, where asymmetric updates and device non-idealities induce a nonzero asymptotic error for standard SGD. It develops a physics-grounded discrete-time model for Analog SGD and proves a lower bound on the irreducible error, then introduces Tiki-Taka, an auxiliary-gradient scheme that eliminates the asymptotic error and achieves convergence to a critical point with a rate matching the stochastic lower bound up to constants. Numerical simulations on synthetic tasks and real datasets corroborate the theory, showing that Tiki-Taka can match or exceed digital SGD performance under realistic AIMC non-idealities. The results provide a principled pathway to exact gradient-based training on AIMC and highlight the practical significance for energy-efficient, scalable AI training on analog hardware.

Abstract

Given the high economic and environmental costs of using large vision or language models, analog in-memory accelerators present a promising solution for energy-efficient AI. While inference on analog accelerators has been studied recently, the training perspective is underexplored. Recent studies have shown that the "workhorse" of digital AI training - stochastic gradient descent (SGD) algorithm converges inexactly when applied to model training on non-ideal devices. This paper puts forth a theoretical foundation for gradient-based training on analog devices. We begin by characterizing the non-convergent issue of SGD, which is caused by the asymmetric updates on the analog devices. We then provide a lower bound of the asymptotic error to show that there is a fundamental performance limit of SGD-based analog training rather than an artifact of our analysis. To address this issue, we study a heuristic analog algorithm called Tiki-Taka that has recently exhibited superior empirical performance compared to SGD and rigorously show its ability to exactly converge to a critical point and hence eliminates the asymptotic error. The simulations verify the correctness of the analyses.
Paper Structure (42 sections, 144 equations, 8 figures, 2 tables)

This paper contains 42 sections, 144 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Digital/Analog SGD under different learning rates.
  • Figure 2: The weight's change with the number of pulses. Positive and negative pulses are sent continuously on the left and right half, respectively. Beginning from $w$, the weight after applying update $\Delta w$ to it is $w^+$ or $w^-$ if $\Delta w \ge 0$ or $\Delta w < 0$, respectively. The response factors $q_+(w)$ and $q_-(w)$ are approximately the slope of the curve at $w$. (Left) Ideal device. $q_+(w)=q_-(w)\equiv 1$. Every point is symmetric point. (Right) Asymmetric Linear Device (ALD). $q_{+}(w) = 1 - {(w-w^\diamond)}/{\tau}, q_{-}(w) = 1 + {(w-w^\diamond)}/{\tau}$. The symmetric point $w_\diamond$ satisfies $q_+(w^\diamond)=q_-(w^\diamond)$.
  • Figure 3: The convergence of digital SGD dynamic \ref{['recursion:SGD']}, analog dynamic \ref{['recursion:analog-GD']} (proposed) and Analog SGD implemented by AIHWKIT (real behavior) under different $\tau$.
  • Figure 4: (Left) The convergence of Analog SGD under different $\tau$. Reducing $\tau$ leads to a decrease in asymptotic error. When $\tau$ is sufficiently large, Analog SGD tends to have a similar performance to digital SGD. (Middle) The convergence of Analog SGD on noise devices under different $\sigma^2$. (Right) Analog SGDs that are initialized to different places converge to the same error.
  • Figure 5: The test accuracy curves and tables for the model training. "D SGD", "A SGD", and "TT" represent Digital SGD, Analog SGD and Tiki-Taka, respectively; (Left) FCN. (Right) CNN.
  • ...and 3 more figures

Theorems & Definitions (17)

  • proof : Proof of Theorem \ref{['theorem:bounded-saturation-variable']}
  • proof : Proof of Theorem \ref{['theorem:bounded-saturation-variable-scalar']}
  • proof
  • proof : Proof of Theorem \ref{['theorem:bounded-saturation-variable-analog-SGD-scalar']}
  • proof : Proof of Lemma \ref{['lemma:saturation-linear']}
  • proof : Proof of Theorem \ref{['theorem:bounded-variable']}
  • proof : Proof of Theorem \ref{['theorem:AGD-convergence-noncvx-linear']}
  • proof : Proof of Theorem \ref{['theorem:ASGD-convergence-noncvx-linear']}
  • proof : Proof of Lemma \ref{['lemma:ASGD-convergence-noncvx-linear-lower-bounded-variable']}
  • proof : Proof of Lemma \ref{['lemma:ASGD-convergence-noncvx-linear-lower-xi-Lip']}
  • ...and 7 more