ViT-LCA: A Neuromorphic Approach for Vision Transformers
Sanaz Mahmoodi Takaghaj
TL;DR
The paper tackles the challenge of deploying Vision Transformers on energy-constrained neuromorphic hardware by introducing ViT-LCA, a two-stage approach that first uses a ViT encoder to extract self-attention representations and then stores these representations as dictionary atoms in a single-layer SNN equipped with a Locally Competitive Algorithm encoder-decoder. The model computes $S = \sum_i \boldsymbol{\phi}_i a_i + \varepsilon$ with Gramian $G = \boldsymbol{\phi}^T \boldsymbol{\phi}$ to drive sparse coding and uses a Maximum Sum of Activations decoder for classification, enabling in-memory, low-power computation on memristor crossbars. Empirically, ViT-LCA achieves competitive top-1 accuracy on CIFAR-10/100 and ImageNet-1K while attaining substantially lower energy per inference compared to contemporary spiking transformer methods, demonstrating the viability of a uniform, neuromorphic-friendly approach to transformer-based vision. The work highlights practical neuromorphic deployment avenues for Vision Transformers and suggests extensions to other transformer architectures and datasets.
Abstract
The recent success of Vision Transformers has generated significant interest in attention mechanisms and transformer architectures. Although existing methods have proposed spiking self-attention mechanisms compatible with spiking neural networks, they often face challenges in effective deployment on current neuromorphic platforms. This paper introduces a novel model that combines vision transformers with the Locally Competitive Algorithm (LCA) to facilitate efficient neuromorphic deployment. Our experiments show that ViT-LCA achieves higher accuracy on ImageNet-1K dataset while consuming significantly less energy than other spiking vision transformer counterparts. Furthermore, ViT-LCA's neuromorphic-friendly design allows for more direct mapping onto current neuromorphic architectures.
