Table of Contents
Fetching ...

FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

TL;DR

FAR tackles the IMC incompatibility of Transformer attention by replacing all attention blocks with multi-head BiLSTM modules trained through block-wise distillation while freezing the pretrained backbone. The approach preserves functional behavior and transfer performance, and employs structured pruning via a DeepHoyer-based criterion to adapt to resource constraints, yielding a compact, hardware-friendly inference path. Hardware analysis on ReRAM IMC indicates substantial reductions in memory traffic and end-to-end latency and energy compared with attention-based baselines, with FAR approaching native IMC efficiency and orders of magnitude gains relative to GPU execution. Overall, FAR demonstrates that end-to-end pretrained transformer inference can be restructured around IMC-friendly sequential modules without retraining from scratch, enabling energy-efficient, scalable deployment on memory-centric accelerators.

Abstract

While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.

FAR: Function-preserving Attention Replacement for IMC-friendly Inference

TL;DR

FAR tackles the IMC incompatibility of Transformer attention by replacing all attention blocks with multi-head BiLSTM modules trained through block-wise distillation while freezing the pretrained backbone. The approach preserves functional behavior and transfer performance, and employs structured pruning via a DeepHoyer-based criterion to adapt to resource constraints, yielding a compact, hardware-friendly inference path. Hardware analysis on ReRAM IMC indicates substantial reductions in memory traffic and end-to-end latency and energy compared with attention-based baselines, with FAR approaching native IMC efficiency and orders of magnitude gains relative to GPU execution. Overall, FAR demonstrates that end-to-end pretrained transformer inference can be restructured around IMC-friendly sequential modules without retraining from scratch, enabling energy-efficient, scalable deployment on memory-centric accelerators.

Abstract

While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.

Paper Structure

This paper contains 17 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: IMC crossbar illustration
  • Figure 2: Block-wise replacement of attention. Each replaced module is supervised by a similarity loss, while a classification loss is applied at the output. During distillation, only the replacement blocks are updated.
  • Figure 3: Multihead BiLSTM module used to replace attention. The input is first projected into $N$ subspaces and split by head. Each subspace is processed by a BiLSTM, and the outputs are concatenated and projected back to the original hidden size.
  • Figure 4: Structured pruning of LSTM hidden units. Removing one unit (shaded row) consistently prunes its input–hidden weights, hidden–hidden weights, and downstream projections. Coordinated pruning across all gate matrices preserves temporal alignment.
  • Figure 5: Pruning ratios across heads and directions. FAR learns to prune differently across layers and directions, revealing internal heterogeneity in representational redundancy.
  • ...and 1 more figures