Table of Contents
Fetching ...

ABI: A tightly integrated, unified, sparsity-aware, reconfigurable, compute near-register file/cache GPU architecture with light-weight softmax for deep learning, linear algebra, and Ising compute

Siddhartha Raman Sundara Raman, Jaydeep P. Kulkarni

TL;DR

The paper proposes ABI, a tightly integrated, sparsity-aware, reconfigurable near-memory GPU architecture that closes the memory-ALU data movement bottleneck across CNNs, GCNs, LP, Ising, and LLM workloads. It combines a near-register file and near-memory logic with a 5-stage Reconfigurable Compute Engine and lightweight near-memory softmax, plus adaptive sparsity circuitry and dynamic resolution up to INT16. The design achieves 6–16x speedups and 6–13x energy savings over a baseline MIAOW GPU, with additional gains from sparsity-aware and softmax accelerators (1.5x and 1.6x energy savings, respectively), reaching about 370 GOPS/W at 250 MHz. Test-chip measurements on a unified architecture across multiple workloads demonstrate strong energy efficiency and performance benefits, suggesting practical impact for near-memory computing in future GPUs and accelerators.

Abstract

We present a tightly integrated and unified near-memory GPU architecture that delivers 6 to 16 times speedup and 6 to 13 times energy savings across Convolutional Neural Networks, Graph Convolutional Networks, Linear Programming, Large Language Models, and Ising workloads compared to MIAOW GPU. The design includes a custom sparsity-aware near-memory circuit providing about 1.5 times energy savings, and a lightweight softmax circuit providing about 1.6 times energy savings. The architecture supports reconfigurable compute up to INT16 with dynamic resolution updates and scales efficiently across problem sizes. ABI-enabled MI300 and Blackwell systems achieve about 4.5 times speedup over baseline MI300 and Blackwell.

ABI: A tightly integrated, unified, sparsity-aware, reconfigurable, compute near-register file/cache GPU architecture with light-weight softmax for deep learning, linear algebra, and Ising compute

TL;DR

The paper proposes ABI, a tightly integrated, sparsity-aware, reconfigurable near-memory GPU architecture that closes the memory-ALU data movement bottleneck across CNNs, GCNs, LP, Ising, and LLM workloads. It combines a near-register file and near-memory logic with a 5-stage Reconfigurable Compute Engine and lightweight near-memory softmax, plus adaptive sparsity circuitry and dynamic resolution up to INT16. The design achieves 6–16x speedups and 6–13x energy savings over a baseline MIAOW GPU, with additional gains from sparsity-aware and softmax accelerators (1.5x and 1.6x energy savings, respectively), reaching about 370 GOPS/W at 250 MHz. Test-chip measurements on a unified architecture across multiple workloads demonstrate strong energy efficiency and performance benefits, suggesting practical impact for near-memory computing in future GPUs and accelerators.

Abstract

We present a tightly integrated and unified near-memory GPU architecture that delivers 6 to 16 times speedup and 6 to 13 times energy savings across Convolutional Neural Networks, Graph Convolutional Networks, Linear Programming, Large Language Models, and Ising workloads compared to MIAOW GPU. The design includes a custom sparsity-aware near-memory circuit providing about 1.5 times energy savings, and a lightweight softmax circuit providing about 1.6 times energy savings. The architecture supports reconfigurable compute up to INT16 with dynamic resolution updates and scales efficiently across problem sizes. ABI-enabled MI300 and Blackwell systems achieve about 4.5 times speedup over baseline MI300 and Blackwell.
Paper Structure (10 sections, 7 figures)

This paper contains 10 sections, 7 figures.

Figures (7)

  • Figure 1: Limitations of existing accelerators (red/first column), proposed design changes to realize ABI (green/second column), resultant energy savings, area, efficiency using ABI (green/third column)
  • Figure 2: ABI enabled a) tightly integrated GPU including dispatcher, compute unit, L2 cache, b) compute unit c) wavefront fetch, pool d) decode, issue e) register file f) load/store units. g) Near-memory(NM) / Near-RF(NRF) logic floorplan h) Programmable registers i) Legend
  • Figure 3: a) Reconfigurable Compute Engine (RCE) with b) thresholding, central adder, scaler c) Speedup showing importance of RCE. ABI’s reconfigurability circuit for d) NM/NRF compute e) varied resolution f) dynamic resolution update
  • Figure 4: Approximate compute block diagram, custom circuit, mappings between block diagram and custom circuit for a) light-weight softmax (LWSM) b) Sparsity monitor necessary for low area, high speed near-memory compute
  • Figure 5: a) Die photograph b) Measurement setup c) Area, power breakdown for CNN, LP, GCN, Ising, LLM across RCE, sparsity, TH, CA, S, PR d) Offline program flow e) Programming model f) Benchmarks g) Parameters of ABI
  • ...and 2 more figures