Table of Contents
Fetching ...

ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars for Write Noise Mitigation

Abhiroop Bhattacharjee, Abhishek Moitra, Priyadarshini Panda

TL;DR

ClipFormer addresses the sensitivity of pre-trained Vision Transformers to write noise in memristive crossbars by introducing an inference-time Key-Value transformation that biases mappings toward lower conductances. Implemented as a two-stage clipping controlled by $\alpha$ and $\beta$, ClipFormer requires no hardware or training changes and is compatible with any memristive crossbar platform. Evaluations with ViT-X on ImageNet-1k show substantial non-ideal-accuracy gains, especially at high write-noise levels, and notable reductions in attention-area and energy for VAT-trained models. The approach provides a practical, plug-in mitigation for crossbar non-idealities, improving robustness while offering hardware-efficiency benefits.

Abstract

Transformers have revolutionized various real-world applications from natural language processing to computer vision. However, traditional von-Neumann computing paradigm faces memory and bandwidth limitations in accelerating transformers owing to their massive model sizes. To this end, In-memory Computing (IMC) crossbars based on Non-volatile Memories (NVMs), due to their ability to perform highly parallelized Matrix-Vector-Multiplications (MVMs) with high energy-efficiencies, have emerged as a promising solution for accelerating transformers. However, analog MVM operations in crossbars introduce non-idealities, such as stochastic read & write noise, which affect the inference accuracy of the deployed transformers. Specifically, we find pre-trained Vision Transformers (ViTs) to be vulnerable on crossbars due to the impact of write noise on the dynamically-generated Key (K) and Value (V) matrices in the attention layers, an effect not accounted for in prior studies. We, thus, propose ClipFormer, a transformation on the K and V matrices during inference, to boost the non-ideal accuracies of pre-trained ViT models. ClipFormer requires no additional hardware and training overhead and is amenable to transformers deployed on any memristive crossbar platform. Our experiments on Imagenet-1k dataset using pre-trained DeiT-S transformers, subjected to standard training and variation-aware-training, show >10-40% higher non-ideal accuracies at the high write noise regime by applying ClipFormer.

ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars for Write Noise Mitigation

TL;DR

ClipFormer addresses the sensitivity of pre-trained Vision Transformers to write noise in memristive crossbars by introducing an inference-time Key-Value transformation that biases mappings toward lower conductances. Implemented as a two-stage clipping controlled by and , ClipFormer requires no hardware or training changes and is compatible with any memristive crossbar platform. Evaluations with ViT-X on ImageNet-1k show substantial non-ideal-accuracy gains, especially at high write-noise levels, and notable reductions in attention-area and energy for VAT-trained models. The approach provides a practical, plug-in mitigation for crossbar non-idealities, improving robustness while offering hardware-efficiency benefits.

Abstract

Transformers have revolutionized various real-world applications from natural language processing to computer vision. However, traditional von-Neumann computing paradigm faces memory and bandwidth limitations in accelerating transformers owing to their massive model sizes. To this end, In-memory Computing (IMC) crossbars based on Non-volatile Memories (NVMs), due to their ability to perform highly parallelized Matrix-Vector-Multiplications (MVMs) with high energy-efficiencies, have emerged as a promising solution for accelerating transformers. However, analog MVM operations in crossbars introduce non-idealities, such as stochastic read & write noise, which affect the inference accuracy of the deployed transformers. Specifically, we find pre-trained Vision Transformers (ViTs) to be vulnerable on crossbars due to the impact of write noise on the dynamically-generated Key (K) and Value (V) matrices in the attention layers, an effect not accounted for in prior studies. We, thus, propose ClipFormer, a transformation on the K and V matrices during inference, to boost the non-ideal accuracies of pre-trained ViT models. ClipFormer requires no additional hardware and training overhead and is amenable to transformers deployed on any memristive crossbar platform. Our experiments on Imagenet-1k dataset using pre-trained DeiT-S transformers, subjected to standard training and variation-aware-training, show >10-40% higher non-ideal accuracies at the high write noise regime by applying ClipFormer.
Paper Structure (13 sections, 11 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 11 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) Radar-chart for a ViT model (DeiT-S) comparing MVMs implemented on RRAM-based IMC crossbars against digital SRAM IMC arrays of size 64$\times$64. (b) Non-ideal accuracy of the ViT model (DeiT-S) with and without ClipFormer. Note, the non-ideal accuracies or accuracy losses are shown across low and high write noise regimes characterized by $\gamma$. Refer to Section \ref{['sec:framework']} for hardware-related details. (c) Pictorial depiction of the overhead of VAT against inference-only ClipFormer method for improving robustness of ViT models against IMC write noise. (Top) In VAT, we go through Step-(1) and Step-(2) iteratively till training convergence. Here, Step-(1) denotes NVM noise-integration and Step-(2) denotes an epoch of VAT. (Bottom) In ClipFormer, we go through Step-(1) and Step-(2) only once. Here, Step-(1) denotes NVM noise-integration and Step-(2) denotes inference with non-ideal parameters. Note, ClipFormer does not involve any training or fine-tuning.
  • Figure 2: Encoder architecture of a Vision transformer.
  • Figure 3: A 2$\times$2 memristive crossbar array.
  • Figure 4: Pictorial representation of the ViT-X framework for pre-trained ViT models. This framework evaluates non-ideal accuracy of ViT models and also estimates the hardware area & energy expended by the attention layers. Various hardware parameters input to the framework are listed in Table \ref{['tab:crossbar-prop']}. If ClipFormer transformation (see Algorithm \ref{['alg:trans']}) is to be integrated with ViT-X, it is done after the second stage before non-ideality integration as shown.
  • Figure 5: Histograms showing the distributions of the $K$ & $V$ matrices in the first attention block of the pre-trained Deit-S model (without including crossbar noise) before and after ClipFormer transformations.
  • ...and 5 more figures