Table of Contents
Fetching ...

Efficient transformer adaptation for analog in-memory computing via low-rank adapters

Chen Li, Elena Ferro, Corey Lammie, Manuel Le Gallo, Irem Boybat, Bipin Rajendran

Abstract

Analog In-Memory Computing (AIMC) offers a promising solution to the von Neumann bottleneck. However, deploying transformer models on AIMC remains challenging due to their inherent need for flexibility and adaptability across diverse tasks. For the benefits of AIMC to be fully realized, weights of static vector-matrix multiplications must be mapped and programmed to analog devices in a weight-stationary manner. This poses two challenges for adapting a base network to hardware and downstream tasks: (i) conventional analog hardware-aware (AHWA) training requires retraining the entire model, and (ii) reprogramming analog devices is both time- and energy-intensive. To address these issues, we propose Analog Hardware-Aware Low-Rank Adaptation (AHWA-LoRA) training, a novel approach for efficiently adapting transformers to AIMC hardware. AHWA-LoRA training keeps the analog weights fixed as meta-weights and introduces lightweight external LoRA modules for both hardware and task adaptation. We validate AHWA-LoRA training on SQuAD v1.1 and the GLUE benchmark, demonstrate its scalability to larger models, and show its effectiveness in instruction tuning and reinforcement learning. We further evaluate a practical deployment scenario that balances AIMC tile latency with digital LoRA processing using optimized pipeline strategies, with RISC-V-based programmable multi-core accelerators. This hybrid architecture achieves efficient transformer inference with only a 4% per-layer overhead compared to a fully AIMC implementation.

Efficient transformer adaptation for analog in-memory computing via low-rank adapters

Abstract

Analog In-Memory Computing (AIMC) offers a promising solution to the von Neumann bottleneck. However, deploying transformer models on AIMC remains challenging due to their inherent need for flexibility and adaptability across diverse tasks. For the benefits of AIMC to be fully realized, weights of static vector-matrix multiplications must be mapped and programmed to analog devices in a weight-stationary manner. This poses two challenges for adapting a base network to hardware and downstream tasks: (i) conventional analog hardware-aware (AHWA) training requires retraining the entire model, and (ii) reprogramming analog devices is both time- and energy-intensive. To address these issues, we propose Analog Hardware-Aware Low-Rank Adaptation (AHWA-LoRA) training, a novel approach for efficiently adapting transformers to AIMC hardware. AHWA-LoRA training keeps the analog weights fixed as meta-weights and introduces lightweight external LoRA modules for both hardware and task adaptation. We validate AHWA-LoRA training on SQuAD v1.1 and the GLUE benchmark, demonstrate its scalability to larger models, and show its effectiveness in instruction tuning and reinforcement learning. We further evaluate a practical deployment scenario that balances AIMC tile latency with digital LoRA processing using optimized pipeline strategies, with RISC-V-based programmable multi-core accelerators. This hybrid architecture achieves efficient transformer inference with only a 4% per-layer overhead compared to a fully AIMC implementation.

Paper Structure

This paper contains 25 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Implementation of hwa-lora training to the multi-head attention block in the standard transformer architecture. a Within each transformer block, the weights of linear (dense) layers with fixed weights ($W \in \mathbb{R}^{m \times n}$) are mapped to aimc tiles. The lora weight matrices $A \in \mathbb{R}^{m \times r}$ and $B \in \mathbb{R}^{r \times n}$ can be trained to adapt the effective weights for different downstream tasks without altering the meta-weights $W \in \mathbb{R}^{m \times n}$. b The high-level proposed architecture, where each aimc tile is paired with a pmca. The latency of aimc tiles and pmca is balanced. Computation of $XAB$ is performed within the pmca, alongside the required addition operation. c Conventional hwa training and deployment methods require retraining of the meta-weights of the pre-trained model. dhwa-lora training preserves the meta-weights from the pre-trained model and deploys them directly to the aimc hardware. Only the lora weight matrices are trained and executed by the DPU.
  • Figure 2: Resource optimization study for MobileBERT and SQuADv1.1 at different drift times. a Pareto front of the F1 score and total LoRA adapter memory for different rank values. A rank of 8 -- highlighted using a vertical dashed line -- provides a good tradeoff between these quantities. b The total number of LoRA parameters when adapters are applied to different layers of the multi-head attention block.
  • Figure 3: Dynamic adaptation and scalability studies at different drift times.a Different configurations for hwa-lora training evaluated on MobileBERT and SQuAD v1.1. *LoRA weight reloading. b Scalability analysis using BERT-Base and BERT-Large.
  • Figure 4: Performance analysis and latency balancing of the proposed architecture with aimc tiles and coupled pmca. a The latency of two different MobileBERT layers with varying aimc tile integration times and bpmca’s TCDM requirement as a function of the number of parallel tokens. c The total latency for all the different layers of MobileBERT with an optimized aimc-pmca pipeline balancing the latency of the aimc tiles and pmca. The latency without any LoRA adapters (aimc) is also reported.