Towards Exact Gradient-based Training on Analog In-memory Computing

Zhaoxian Wu; Tayfun Gokmen; Malte J. Rasch; Tianyi Chen

Towards Exact Gradient-based Training on Analog In-memory Computing

Zhaoxian Wu, Tayfun Gokmen, Malte J. Rasch, Tianyi Chen

TL;DR

This paper addresses the challenge of gradient-based training on analog in-memory computing (AIMC) hardware, where asymmetric updates and device non-idealities induce a nonzero asymptotic error for standard SGD. It develops a physics-grounded discrete-time model for Analog SGD and proves a lower bound on the irreducible error, then introduces Tiki-Taka, an auxiliary-gradient scheme that eliminates the asymptotic error and achieves convergence to a critical point with a rate matching the stochastic lower bound up to constants. Numerical simulations on synthetic tasks and real datasets corroborate the theory, showing that Tiki-Taka can match or exceed digital SGD performance under realistic AIMC non-idealities. The results provide a principled pathway to exact gradient-based training on AIMC and highlight the practical significance for energy-efficient, scalable AI training on analog hardware.

Abstract

Given the high economic and environmental costs of using large vision or language models, analog in-memory accelerators present a promising solution for energy-efficient AI. While inference on analog accelerators has been studied recently, the training perspective is underexplored. Recent studies have shown that the "workhorse" of digital AI training - stochastic gradient descent (SGD) algorithm converges inexactly when applied to model training on non-ideal devices. This paper puts forth a theoretical foundation for gradient-based training on analog devices. We begin by characterizing the non-convergent issue of SGD, which is caused by the asymmetric updates on the analog devices. We then provide a lower bound of the asymptotic error to show that there is a fundamental performance limit of SGD-based analog training rather than an artifact of our analysis. To address this issue, we study a heuristic analog algorithm called Tiki-Taka that has recently exhibited superior empirical performance compared to SGD and rigorously show its ability to exactly converge to a critical point and hence eliminates the asymptotic error. The simulations verify the correctness of the analyses.

Towards Exact Gradient-based Training on Analog In-memory Computing

TL;DR

Abstract

Paper Structure (42 sections, 144 equations, 8 figures, 2 tables)

This paper contains 42 sections, 144 equations, 8 figures, 2 tables.

Introduction
Main results
Prior art
The Physics of Analog Training
Revisit SGD theory and its failure in modeling analog training
Training dynamic on analog devices
Saturation, fast reset, and bounded weight
Performance Limits of Analog Stochastic Gradient Descent
Eliminating Asymptotic Error of Analog Training: Tiki-Taka
Numerical Simulations
Verification of the analog training dynamic
Ablation study on the asymptotic training error
Analog training performance on real dataset
Conclusions and Limitations
Literature Review
...and 27 more sections

Figures (8)

Figure 1: Digital/Analog SGD under different learning rates.
Figure 2: The weight's change with the number of pulses. Positive and negative pulses are sent continuously on the left and right half, respectively. Beginning from $w$, the weight after applying update $\Delta w$ to it is $w^+$ or $w^-$ if $\Delta w \ge 0$ or $\Delta w < 0$, respectively. The response factors $q_+(w)$ and $q_-(w)$ are approximately the slope of the curve at $w$. (Left) Ideal device. $q_+(w)=q_-(w)\equiv 1$. Every point is symmetric point. (Right) Asymmetric Linear Device (ALD). $q_{+}(w) = 1 - {(w-w^\diamond)}/{\tau}, q_{-}(w) = 1 + {(w-w^\diamond)}/{\tau}$. The symmetric point $w_\diamond$ satisfies $q_+(w^\diamond)=q_-(w^\diamond)$.
Figure 3: The convergence of digital SGD dynamic \ref{['recursion:SGD']}, analog dynamic \ref{['recursion:analog-GD']} (proposed) and Analog SGD implemented by AIHWKIT (real behavior) under different $\tau$.
Figure 4: (Left) The convergence of Analog SGD under different $\tau$. Reducing $\tau$ leads to a decrease in asymptotic error. When $\tau$ is sufficiently large, Analog SGD tends to have a similar performance to digital SGD. (Middle) The convergence of Analog SGD on noise devices under different $\sigma^2$. (Right) Analog SGDs that are initialized to different places converge to the same error.
Figure 5: The test accuracy curves and tables for the model training. "D SGD", "A SGD", and "TT" represent Digital SGD, Analog SGD and Tiki-Taka, respectively; (Left) FCN. (Right) CNN.
...and 3 more figures

Theorems & Definitions (17)

proof : Proof of Theorem \ref{['theorem:bounded-saturation-variable']}
proof : Proof of Theorem \ref{['theorem:bounded-saturation-variable-scalar']}
proof
proof : Proof of Theorem \ref{['theorem:bounded-saturation-variable-analog-SGD-scalar']}
proof : Proof of Lemma \ref{['lemma:saturation-linear']}
proof : Proof of Theorem \ref{['theorem:bounded-variable']}
proof : Proof of Theorem \ref{['theorem:AGD-convergence-noncvx-linear']}
proof : Proof of Theorem \ref{['theorem:ASGD-convergence-noncvx-linear']}
proof : Proof of Lemma \ref{['lemma:ASGD-convergence-noncvx-linear-lower-bounded-variable']}
proof : Proof of Lemma \ref{['lemma:ASGD-convergence-noncvx-linear-lower-xi-Lip']}
...and 7 more

Towards Exact Gradient-based Training on Analog In-memory Computing

TL;DR

Abstract

Towards Exact Gradient-based Training on Analog In-memory Computing

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (17)