FAR: Function-preserving Attention Replacement for IMC-friendly Inference
Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang
TL;DR
FAR tackles the IMC incompatibility of Transformer attention by replacing all attention blocks with multi-head BiLSTM modules trained through block-wise distillation while freezing the pretrained backbone. The approach preserves functional behavior and transfer performance, and employs structured pruning via a DeepHoyer-based criterion to adapt to resource constraints, yielding a compact, hardware-friendly inference path. Hardware analysis on ReRAM IMC indicates substantial reductions in memory traffic and end-to-end latency and energy compared with attention-based baselines, with FAR approaching native IMC efficiency and orders of magnitude gains relative to GPU execution. Overall, FAR demonstrates that end-to-end pretrained transformer inference can be restructured around IMC-friendly sequential modules without retraining from scratch, enabling energy-efficient, scalable deployment on memory-centric accelerators.
Abstract
While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.
