Table of Contents
Fetching ...

A Novel Vision Transformer with Residual in Self-attention for Biomedical Image Classification

Arun K. Sharma, Nishchal K. Verma

TL;DR

The paper addresses the challenge of biomedical image classification under limited and imbalanced data by enhancing Vision Transformer (ViT) attention with a residual, best-head selection mechanism. It introduces a residual multi-head self-attention that ranks heads by the $L_1$-norm of their probabilistic attention and adds the best head's output as a residual to the final projection, integrated into a pre-trained ViT base. Evaluations on two biomedical datasets (blood cell images and brain MRI tumor detection) show improved accuracy, precision, recall, and F1 over CNN baselines and vanilla ViT, with better generalization as evidenced by smaller train–validation gaps. The approach maintains comparable computational complexity to standard ViT while delivering higher discriminative power, making it viable for small datasets and demanding biomedical tasks, and it invites future exploration on training from scratch on larger datasets.

Abstract

Biomedical image classification requires capturing of bio-informatics based on specific feature distribution. In most of such applications, there are mainly challenges due to limited availability of samples for diseased cases and imbalanced nature of dataset. This article presents the novel framework of multi-head self-attention for vision transformer (ViT) which makes capable of capturing the specific image features for classification and analysis. The proposed method uses the concept of residual connection for accumulating the best attention output in each block of multi-head attention. The proposed framework has been evaluated on two small datasets: (i) blood cell classification dataset and (ii) brain tumor detection using brain MRI images. The results show the significant improvement over traditional ViT and other convolution based state-of-the-art classification models.

A Novel Vision Transformer with Residual in Self-attention for Biomedical Image Classification

TL;DR

The paper addresses the challenge of biomedical image classification under limited and imbalanced data by enhancing Vision Transformer (ViT) attention with a residual, best-head selection mechanism. It introduces a residual multi-head self-attention that ranks heads by the -norm of their probabilistic attention and adds the best head's output as a residual to the final projection, integrated into a pre-trained ViT base. Evaluations on two biomedical datasets (blood cell images and brain MRI tumor detection) show improved accuracy, precision, recall, and F1 over CNN baselines and vanilla ViT, with better generalization as evidenced by smaller train–validation gaps. The approach maintains comparable computational complexity to standard ViT while delivering higher discriminative power, making it viable for small datasets and demanding biomedical tasks, and it invites future exploration on training from scratch on larger datasets.

Abstract

Biomedical image classification requires capturing of bio-informatics based on specific feature distribution. In most of such applications, there are mainly challenges due to limited availability of samples for diseased cases and imbalanced nature of dataset. This article presents the novel framework of multi-head self-attention for vision transformer (ViT) which makes capable of capturing the specific image features for classification and analysis. The proposed method uses the concept of residual connection for accumulating the best attention output in each block of multi-head attention. The proposed framework has been evaluated on two small datasets: (i) blood cell classification dataset and (ii) brain tumor detection using brain MRI images. The results show the significant improvement over traditional ViT and other convolution based state-of-the-art classification models.
Paper Structure (16 sections, 13 figures, 1 table)

This paper contains 16 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: Architecture of Transformer in NLP trans01
  • Figure 2: Architecture of Vision Transformer for computer vision vit
  • Figure 3: The proposed attention mechanism in Vision Transformer
  • Figure 4: Class distribution of the blood cell image dataset
  • Figure 5: Class distribution of Brain MRI Images for Brain Tumor Detection
  • ...and 8 more figures