ConvShareViT: Enhancing Vision Transformers with Convolutional Attention Mechanisms for Free-Space Optical Accelerators
Riad Ibadulla, Thomas M. Chen, Constantino Carlos Reyes-Aldasoro
TL;DR
ConvShareViT provides a convolution-only adaptation of Vision Transformers designed for the 4f free-space optical accelerator, enabling high-throughput, parallel inference by replacing linear QKV projections and MLPs with a shared depthwise convolution framework. The method demonstrates that receptive-field-aligned, valid-padding depthwise convolutions can learn attention within MHSA, preserving patch-based transformer semantics while leveraging optical parallelism. Systematic experiments on CIFAR-100 reveal that configurations with weight sharing and valid padding achieve attention learning and competitive accuracy (best around 63%), whereas same-padding variants underperform and resemble traditional CNNs. The work shows theoretical and practical pathways to accelerate transformer-based vision models in optical hardware, with substantial potential speedups over GPU baselines and a roadmap for further optimisations and real-world optical testing.
Abstract
This paper introduces ConvShareViT, a novel deep learning architecture that adapts Vision Transformers (ViTs) to the 4f free-space optical system. ConvShareViT replaces linear layers in multi-head self-attention (MHSA) and Multilayer Perceptrons (MLPs) with a depthwise convolutional layer with shared weights across input channels. Through the development of ConvShareViT, the behaviour of convolutions within MHSA and their effectiveness in learning the attention mechanism were analysed systematically. Experimental results demonstrate that certain configurations, particularly those using valid-padded shared convolutions, can successfully learn attention, achieving comparable attention scores to those obtained with standard ViTs. However, other configurations, such as those using same-padded convolutions, show limitations in attention learning and operate like regular CNNs rather than transformer models. ConvShareViT architectures are specifically optimised for the 4f optical system, which takes advantage of the parallelism and high-resolution capabilities of optical systems. Results demonstrate that ConvShareViT can theoretically achieve up to 3.04 times faster inference than GPU-based systems. This potential acceleration makes ConvShareViT an attractive candidate for future optical deep learning applications and proves that our ViT (ConvShareViT) can be employed using only the convolution operation, via the necessary optimisation of the ViT to balance performance and complexity.
