Table of Contents
Fetching ...

CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion

Aon Safdar, Mohamed Saadeldin

TL;DR

This paper tackles the rigidity of static channel mixing in Vision Transformers by introducing a dual-attention block called CAViT. It replaces the fixed MLP with a channel-wise attention stage learned via a dimension-swapping mechanism that treats channels as tokens, enabling dynamic inter-channel interactions conditioned on global context. Across five natural and medical imaging datasets, CAViT achieves up to +3.6% accuracy while reducing parameters and FLOPs by over 30% compared to ViT baselines, with qualitative attention maps showing sharper, semantically meaningful focus. This work demonstrates that unified, attention-based token mixing can boost representational power without increasing depth, suggesting a scalable path for efficient vision transformers.

Abstract

Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce 'CAViT', a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.

CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion

TL;DR

This paper tackles the rigidity of static channel mixing in Vision Transformers by introducing a dual-attention block called CAViT. It replaces the fixed MLP with a channel-wise attention stage learned via a dimension-swapping mechanism that treats channels as tokens, enabling dynamic inter-channel interactions conditioned on global context. Across five natural and medical imaging datasets, CAViT achieves up to +3.6% accuracy while reducing parameters and FLOPs by over 30% compared to ViT baselines, with qualitative attention maps showing sharper, semantically meaningful focus. This work demonstrates that unified, attention-based token mixing can boost representational power without increasing depth, suggesting a scalable path for efficient vision transformers.

Abstract

Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce 'CAViT', a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.
Paper Structure (13 sections, 5 figures, 3 tables)

This paper contains 13 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: CAViT overview. (Left) Standard tokenization. (Middle) Each block applies spatial MHSA on tokens $\mathbf{x}\!\in\!\mathbb{R}^{B\times(N{+}1)\times C}$, then swaps dimensions to $\mathbb{R}^{B\times(C{+}1)\times N}$ to treat channels as tokens and applies single-head channel self-attention (SHSA); the swap is then reversed. For spatial MHSA: $T\!=\!N{+}1$, $D\!=\!C$, $n{>}1$; for channel SHSA: $T\!=\!C{+}1$, $D\!=\!N$, $n\!=\!1$. (Right) Standard $Q,K,V$ projections and softmax over $T\times T$.
  • Figure 2: CAVit Transformer Block. We propose a dual-attention Transformer block that models both spatial and channel-wise interactions using two sequential self-attention stages. The first stage applies standard MHSA over spatial tokens and operates across the token axis. The second stage performs channel-wise attention by swapping spatial and channel dimensions and applying SHSA across channels.
  • Figure 3: Dimension Swapping Mechanism. Illustration of our dimension swapping logic used to enable channel-wise attention. The class token (CLS) is first separated from the input tensor $\mathbf{x} \in \mathbb{R}^{B \times (N{+}1) \times C}$, where $B$ is batch size, $N$ is the number of spatial tokens, and $C$ is the channel dimension. The remaining spatial tokens are transposed from shape $\mathbb{R}^{B \times N \times C}$ to $\mathbb{R}^{B \times C \times N}$, effectively treating channels as attention tokens. The CLS token is then concatenated back, resulting in a new tensor of shape $\mathbb{R}^{B \times (C{+}1) \times N}$. Single-head self-attention (SHSA) is applied across this transformed token dimension. The reverse operation restores the original layout, enabling standard downstream processing.
  • Figure 4: Top-1 Val Accuracy across Training Epochs. Comparison of ViT$_{\text{tiny}}$ and CAViT$_{\text{tiny}}$ on five benchmarks. CAViT consistently shows faster convergence and better generalization on natural (CIFAR-10 cifar10, CatsVsDogs dogs-vs-cats) and medical datasets (PneumoniaMNIST yang_medmnist_2023, Malaria malaria_dataset, BreastMNIST yang_medmnist_2023). Notably, CAViT yields consistent gains on datasets with limited resolution and inter-class variability.
  • Figure 5: Visualization of token attention: Attention for various samples across domains including natural images, medical scans, and low-resolution categories. We use DINO-style visualization by averaging token attention maps across heads and tokens to highlight structural saliency. CAViT-Tiny captures more spatially coherent and semantically focused attention and better localizes foreground regions and anatomical features. (Best viewed in color and zoomed-in).