Table of Contents
Fetching ...

Accelerating AI Performance using Anderson Extrapolation on GPUs

Saleem Abdul Fattah Ahmed Al Dajani, David E. Keyes

TL;DR

This work targets the bottleneck of convergence speed in AI workloads by applying Anderson extrapolation to fixed-point iterations, with a focus on deep equilibrium models (DEQs) in GPU environments. By using a windowed history of iterates and a residual-minimizing weighting scheme, the approach accelerates forward passes and training without relying on Hessian inversions, yielding faster convergence and more stable accuracy plateaus. Empirical results on CIFAR-10 show substantial speedups (2×–8.6×) and reduced computation per solution, while achieving higher training and testing accuracy plateaus than standard forward iterations. The method is matrix-free and well-suited to HPC-scale architectures, offering potential energy and performance benefits for large-scale AI workloads, with future directions including stochastic variants and broader hardware deployments.

Abstract

We present a novel approach for accelerating AI performance by leveraging Anderson extrapolation, a vector-to-vector mapping technique based on a window of historical iterations. By identifying the crossover point (Fig. 1) where a mixing penalty is incurred, the method focuses on reducing iterations to convergence, with fewer more compute-intensive but generally cacheable iterations, balancing speed and memory usage with accuracy and algorithmic stability, respectively. We demonstrate significant improvements, in both training and inference, motivated by scalability and efficiency extensions to the realm of high-performance computing (HPC).

Accelerating AI Performance using Anderson Extrapolation on GPUs

TL;DR

This work targets the bottleneck of convergence speed in AI workloads by applying Anderson extrapolation to fixed-point iterations, with a focus on deep equilibrium models (DEQs) in GPU environments. By using a windowed history of iterates and a residual-minimizing weighting scheme, the approach accelerates forward passes and training without relying on Hessian inversions, yielding faster convergence and more stable accuracy plateaus. Empirical results on CIFAR-10 show substantial speedups (2×–8.6×) and reduced computation per solution, while achieving higher training and testing accuracy plateaus than standard forward iterations. The method is matrix-free and well-suited to HPC-scale architectures, offering potential energy and performance benefits for large-scale AI workloads, with future directions including stochastic variants and broader hardware deployments.

Abstract

We present a novel approach for accelerating AI performance by leveraging Anderson extrapolation, a vector-to-vector mapping technique based on a window of historical iterations. By identifying the crossover point (Fig. 1) where a mixing penalty is incurred, the method focuses on reducing iterations to convergence, with fewer more compute-intensive but generally cacheable iterations, balancing speed and memory usage with accuracy and algorithmic stability, respectively. We demonstrate significant improvements, in both training and inference, motivated by scalability and efficiency extensions to the realm of high-performance computing (HPC).

Paper Structure

This paper contains 13 sections, 7 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Crossover and mixing penalty plotted against time. Relative residual is $\frac{\|f(z^k,x) - z^k\|_2}{\|f(z^k,x)\|_2 + \lambda}$kolter2020.
  • Figure 2: AI carbon footprint projected to consume >2% of global electricity demand andrae2015globalde2023growingpatterson2021carbonjones2018stop, amounting to >10% of global electricity demand for data centers and infrastructure.
  • Figure 3: Mathematical formulation and vector representation. Adapted from Y. He & H. De Sterck. "Linear Asymptotic Convergence Analysis of Anderson Acceleration, with Krylov Formulation in the Linear Case" Copper Mountain Conference (2022), ICERM Workshop (2023). Available at: https://www.bilibili.com/video/BV1Wa411i77y/ and https://icerm.brown.edu/video_archive/?play=3320
  • Figure 4: Deep equilibrium neural network model architecture (Source: NeurIPS Tutorial, 2020 kolter2020). $f(z,x) = \mathrm{norm}(\mathrm{ReLU}(z + \mathrm{norm}(x + W_2*(\mathrm{norm}(\mathrm{ReLU}(W_1 * z))))))$. "norm" here is a group norm, representing a statistical normalization wu2018group.
  • Figure 5: Evaluating CIFAR10 dataset through deep equilibrium. Anderson is 1.2x more accurate at stable convergence above mixing penalty.
  • ...and 2 more figures