Table of Contents
Fetching ...

ConvShareViT: Enhancing Vision Transformers with Convolutional Attention Mechanisms for Free-Space Optical Accelerators

Riad Ibadulla, Thomas M. Chen, Constantino Carlos Reyes-Aldasoro

TL;DR

ConvShareViT provides a convolution-only adaptation of Vision Transformers designed for the 4f free-space optical accelerator, enabling high-throughput, parallel inference by replacing linear QKV projections and MLPs with a shared depthwise convolution framework. The method demonstrates that receptive-field-aligned, valid-padding depthwise convolutions can learn attention within MHSA, preserving patch-based transformer semantics while leveraging optical parallelism. Systematic experiments on CIFAR-100 reveal that configurations with weight sharing and valid padding achieve attention learning and competitive accuracy (best around 63%), whereas same-padding variants underperform and resemble traditional CNNs. The work shows theoretical and practical pathways to accelerate transformer-based vision models in optical hardware, with substantial potential speedups over GPU baselines and a roadmap for further optimisations and real-world optical testing.

Abstract

This paper introduces ConvShareViT, a novel deep learning architecture that adapts Vision Transformers (ViTs) to the 4f free-space optical system. ConvShareViT replaces linear layers in multi-head self-attention (MHSA) and Multilayer Perceptrons (MLPs) with a depthwise convolutional layer with shared weights across input channels. Through the development of ConvShareViT, the behaviour of convolutions within MHSA and their effectiveness in learning the attention mechanism were analysed systematically. Experimental results demonstrate that certain configurations, particularly those using valid-padded shared convolutions, can successfully learn attention, achieving comparable attention scores to those obtained with standard ViTs. However, other configurations, such as those using same-padded convolutions, show limitations in attention learning and operate like regular CNNs rather than transformer models. ConvShareViT architectures are specifically optimised for the 4f optical system, which takes advantage of the parallelism and high-resolution capabilities of optical systems. Results demonstrate that ConvShareViT can theoretically achieve up to 3.04 times faster inference than GPU-based systems. This potential acceleration makes ConvShareViT an attractive candidate for future optical deep learning applications and proves that our ViT (ConvShareViT) can be employed using only the convolution operation, via the necessary optimisation of the ViT to balance performance and complexity.

ConvShareViT: Enhancing Vision Transformers with Convolutional Attention Mechanisms for Free-Space Optical Accelerators

TL;DR

ConvShareViT provides a convolution-only adaptation of Vision Transformers designed for the 4f free-space optical accelerator, enabling high-throughput, parallel inference by replacing linear QKV projections and MLPs with a shared depthwise convolution framework. The method demonstrates that receptive-field-aligned, valid-padding depthwise convolutions can learn attention within MHSA, preserving patch-based transformer semantics while leveraging optical parallelism. Systematic experiments on CIFAR-100 reveal that configurations with weight sharing and valid padding achieve attention learning and competitive accuracy (best around 63%), whereas same-padding variants underperform and resemble traditional CNNs. The work shows theoretical and practical pathways to accelerate transformer-based vision models in optical hardware, with substantial potential speedups over GPU baselines and a roadmap for further optimisations and real-world optical testing.

Abstract

This paper introduces ConvShareViT, a novel deep learning architecture that adapts Vision Transformers (ViTs) to the 4f free-space optical system. ConvShareViT replaces linear layers in multi-head self-attention (MHSA) and Multilayer Perceptrons (MLPs) with a depthwise convolutional layer with shared weights across input channels. Through the development of ConvShareViT, the behaviour of convolutions within MHSA and their effectiveness in learning the attention mechanism were analysed systematically. Experimental results demonstrate that certain configurations, particularly those using valid-padded shared convolutions, can successfully learn attention, achieving comparable attention scores to those obtained with standard ViTs. However, other configurations, such as those using same-padded convolutions, show limitations in attention learning and operate like regular CNNs rather than transformer models. ConvShareViT architectures are specifically optimised for the 4f optical system, which takes advantage of the parallelism and high-resolution capabilities of optical systems. Results demonstrate that ConvShareViT can theoretically achieve up to 3.04 times faster inference than GPU-based systems. This potential acceleration makes ConvShareViT an attractive candidate for future optical deep learning applications and proves that our ViT (ConvShareViT) can be employed using only the convolution operation, via the necessary optimisation of the ViT to balance performance and complexity.

Paper Structure

This paper contains 14 sections, 5 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Schematic illustration of a 4f optical system executing a convolution operation. The system consists of an input plane (laser source), a convex lens, and a Fourier plane (containing a modulator or phase mask), followed by another convex lens and a camera, each positioned one focal length away from the lenses. As the light passes through the first lens, it undergoes a 2D Fourier transform at the Fourier plane, where it is multiplied by the kernel in the frequency domain. The light then travels through the second lens, which transforms it back to the spatial domain, and the camera captures the output.
  • Figure 2: Self-Attention mechanism. Inputs are mapped to Query, Key, and Value vectors. Attention scores, calculated from Query-Key multiplications, achieving dependencies between tokens. These scores are then used to weight the Value matrix, amplifying relevant information.
  • Figure 3: Comparison of the regular ViT (top of figure) and ConvShareViT (bottom of the figure) pipelines. ViTs vectorise patches of the image and apply a linear layer to map them into higher dimensional embeddings, while ConvShareViT keeps the patches in 2D format and uses Transpose Convolution to increase the dimensionality. ConvShareViT uses MHSA and MLPs using Shared Depthwise Convolutional layers.
  • Figure 4: Implementation of the linear layer using convolution and tiled convolution for 4f system. (a) A simple linear layer of 1 vector. The output vector is the vector-matrix multiplication of the input vector with the weight matrix. Each output node has its own set of weights. (b) Input nodes are in 2D matrix format, convolved with the kernel of equal size and valid padding. The output is similar to one output pixel of the linear layer. (c) Kernel tiling is used to tile all weights of the linear layer in the kernel block. The input is padded to the required resolution. The output archives all output nodes of the linear layer, with the requirement of reshaping (removes zeros in invalid regions).
  • Figure 5: Shared depthwise convolutional layer, copies the weights across all input channels. (a) Regular convolutional layer, with the groups=1. The number of 2D kernels is equal to the number of input channels $\times$ the number of output channels. (b) Depthwise convolution, where the number of groups is equal to the number of input channels. In this case, each output channel gets only one 2D kernel, meaning no channel summation happens. (c) In the shared depthwise convolutional layer, unlike the regular depthwise convolutional layer, the weights are shared across input channels, making it ideal for the emulation of the Linear Layer. If the kernels are the same resolution as inputs, the valid convolution yields one pixel for each output channel, which can be reshaped into the initial resolution.
  • ...and 8 more figures