Towards Exact Gradient-based Training on Analog In-memory Computing
Zhaoxian Wu, Tayfun Gokmen, Malte J. Rasch, Tianyi Chen
TL;DR
This paper addresses the challenge of gradient-based training on analog in-memory computing (AIMC) hardware, where asymmetric updates and device non-idealities induce a nonzero asymptotic error for standard SGD. It develops a physics-grounded discrete-time model for Analog SGD and proves a lower bound on the irreducible error, then introduces Tiki-Taka, an auxiliary-gradient scheme that eliminates the asymptotic error and achieves convergence to a critical point with a rate matching the stochastic lower bound up to constants. Numerical simulations on synthetic tasks and real datasets corroborate the theory, showing that Tiki-Taka can match or exceed digital SGD performance under realistic AIMC non-idealities. The results provide a principled pathway to exact gradient-based training on AIMC and highlight the practical significance for energy-efficient, scalable AI training on analog hardware.
Abstract
Given the high economic and environmental costs of using large vision or language models, analog in-memory accelerators present a promising solution for energy-efficient AI. While inference on analog accelerators has been studied recently, the training perspective is underexplored. Recent studies have shown that the "workhorse" of digital AI training - stochastic gradient descent (SGD) algorithm converges inexactly when applied to model training on non-ideal devices. This paper puts forth a theoretical foundation for gradient-based training on analog devices. We begin by characterizing the non-convergent issue of SGD, which is caused by the asymmetric updates on the analog devices. We then provide a lower bound of the asymptotic error to show that there is a fundamental performance limit of SGD-based analog training rather than an artifact of our analysis. To address this issue, we study a heuristic analog algorithm called Tiki-Taka that has recently exhibited superior empirical performance compared to SGD and rigorously show its ability to exactly converge to a critical point and hence eliminates the asymptotic error. The simulations verify the correctness of the analyses.
