Table of Contents
Fetching ...

Classification of Endoscopy and Video Capsule Images using CNN-Transformer Model

Aliza Subedi, Smriti Regmi, Nisha Regmi, Bhumi Bhusal, Ulas Bagci, Debesh Jha

TL;DR

The paper tackles automatic, multi-class GI endoscopy image classification despite dataset imbalance by proposing a hybrid architecture that fuses DenseNet201-based CNN features with a Swin Transformer branch to capture both local details and global context. The approach, trained and evaluated on GastroVision and Kvasir-Capsule using stratified 5-fold cross-validation, outperforms standalone CNN and Transformer baselines across metrics such as MCC, precision, recall, and F1. Key contributions include the three-component architecture, end-to-end fine-tuning, and saliency-map explanations that enhance model interpretability in a high-risk medical domain. The work demonstrates strong potential for aiding early GI anomaly detection and supports future clinical collaboration to refine and deploy AI-assisted diagnosis tools in endoscopic workflows.

Abstract

Gastrointestinal cancer is a leading cause of cancer-related incidence and death, making it crucial to develop novel computer-aided diagnosis systems for early detection and enhanced treatment. Traditional approaches rely on the expertise of gastroenterologists to identify diseases; however, this process is subjective, and interpretation can vary even among expert clinicians. Considering recent advancements in classifying gastrointestinal anomalies and landmarks in endoscopic and video capsule endoscopy images, this study proposes a hybrid model that combines the advantages of Transformers and Convolutional Neural Networks (CNNs) to enhance classification performance. Our model utilizes DenseNet201 as a CNN branch to extract local features and integrates a Swin Transformer branch for global feature understanding, combining both to perform the classification task. For the GastroVision dataset, our proposed model demonstrates excellent performance with Precision, Recall, F1 score, Accuracy, and Matthews Correlation Coefficient (MCC) of 0.8320, 0.8386, 0.8324, 0.8386, and 0.8191, respectively, showcasing its robustness against class imbalance and surpassing other CNNs as well as the Swin Transformer model. Similarly, for the Kvasir-Capsule, a large video capsule endoscopy dataset, our model outperforms all others, achieving overall Precision, Recall, F1 score, Accuracy, and MCC of 0.7007, 0.7239, 0.6900, 0.7239, and 0.3871. Moreover, we generated saliency maps to explain our model's focus areas, demonstrating its reliable decision-making process. The results underscore the potential of our hybrid CNN-Transformer model in aiding the early and accurate detection of gastrointestinal (GI) anomalies.

Classification of Endoscopy and Video Capsule Images using CNN-Transformer Model

TL;DR

The paper tackles automatic, multi-class GI endoscopy image classification despite dataset imbalance by proposing a hybrid architecture that fuses DenseNet201-based CNN features with a Swin Transformer branch to capture both local details and global context. The approach, trained and evaluated on GastroVision and Kvasir-Capsule using stratified 5-fold cross-validation, outperforms standalone CNN and Transformer baselines across metrics such as MCC, precision, recall, and F1. Key contributions include the three-component architecture, end-to-end fine-tuning, and saliency-map explanations that enhance model interpretability in a high-risk medical domain. The work demonstrates strong potential for aiding early GI anomaly detection and supports future clinical collaboration to refine and deploy AI-assisted diagnosis tools in endoscopic workflows.

Abstract

Gastrointestinal cancer is a leading cause of cancer-related incidence and death, making it crucial to develop novel computer-aided diagnosis systems for early detection and enhanced treatment. Traditional approaches rely on the expertise of gastroenterologists to identify diseases; however, this process is subjective, and interpretation can vary even among expert clinicians. Considering recent advancements in classifying gastrointestinal anomalies and landmarks in endoscopic and video capsule endoscopy images, this study proposes a hybrid model that combines the advantages of Transformers and Convolutional Neural Networks (CNNs) to enhance classification performance. Our model utilizes DenseNet201 as a CNN branch to extract local features and integrates a Swin Transformer branch for global feature understanding, combining both to perform the classification task. For the GastroVision dataset, our proposed model demonstrates excellent performance with Precision, Recall, F1 score, Accuracy, and Matthews Correlation Coefficient (MCC) of 0.8320, 0.8386, 0.8324, 0.8386, and 0.8191, respectively, showcasing its robustness against class imbalance and surpassing other CNNs as well as the Swin Transformer model. Similarly, for the Kvasir-Capsule, a large video capsule endoscopy dataset, our model outperforms all others, achieving overall Precision, Recall, F1 score, Accuracy, and MCC of 0.7007, 0.7239, 0.6900, 0.7239, and 0.3871. Moreover, we generated saliency maps to explain our model's focus areas, demonstrating its reliable decision-making process. The results underscore the potential of our hybrid CNN-Transformer model in aiding the early and accurate detection of gastrointestinal (GI) anomalies.
Paper Structure (19 sections, 3 figures, 2 tables)

This paper contains 19 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Schematic representation of the Hybrid CNN-Transformer model.
  • Figure 2: Example samples from GastroVision jha2023GastroVision and Kvasir-Capsule smedsrud2021kvasir. The first three columns display images from GI endoscopy (GastroVision dataset), last three columns display from video capsule endoscopy (Kvasir-Capsule).
  • Figure 3: Saliency maps visualization for GastroVision dataset for four representative classes. The Figure shows the significant regions in the input image that contribute to the model's decision.