Improved EATFormer: A Vision Transformer for Medical Image Classification
Yulong Shisu, Susano Mingwin, Yongshuai Wanwag, Zengqiang Chenso, Sunshin Huing
TL;DR
This work introduces EATFormer, a four-stage Vision Transformer backbone augmented with evolutionary-algorithm-inspired modules to improve medical image classification. Key innovations include Multi-Scale Region Aggregation (MSRA) for multi-scale inductive bias, Global and Local Interaction (GLI) for combined global-local features, and Modulated Deformable MSA (MD-MSA) for deformable attention. A Task-Related Head (TRH) accompanies the backbone, enabling flexible information fusion. Evaluations on Chest X-ray and Kvasir datasets demonstrate superior accuracy and competitive efficiency compared to CNNs and ViTs, indicating potential for faster, more reliable computer-aided diagnosis in clinical practice.
Abstract
The accurate analysis of medical images is vital for diagnosing and predicting medical conditions. Traditional approaches relying on radiologists and clinicians suffer from inconsistencies and missed diagnoses. Computer-aided diagnosis systems can assist in achieving early, accurate, and efficient diagnoses. This paper presents an improved Evolutionary Algorithm-based Transformer architecture for medical image classification using Vision Transformers. The proposed EATFormer architecture combines the strengths of Convolutional Neural Networks and Vision Transformers, leveraging their ability to identify patterns in data and adapt to specific characteristics. The architecture incorporates novel components, including the Enhanced EA-based Transformer block with Feed-Forward Network, Global and Local Interaction , and Multi-Scale Region Aggregation modules. It also introduces the Modulated Deformable MSA module for dynamic modeling of irregular locations. The paper discusses the Vision Transformer (ViT) model's key features, such as patch-based processing, positional context incorporation, and Multi-Head Attention mechanism. It introduces the Multi-Scale Region Aggregation module, which aggregates information from different receptive fields to provide an inductive bias. The Global and Local Interaction module enhances the MSA-based global module by introducing a local path for extracting discriminative local information. Experimental results on the Chest X-ray and Kvasir datasets demonstrate that the proposed EATFormer significantly improves prediction speed and accuracy compared to baseline models.
