Table of Contents
Fetching ...

Multi-Class Abnormality Classification Task in Video Capsule Endoscopy

Dev Rishi Verma, Vibhor Saxena, Dhruv Sharma, Arpan Gupta

TL;DR

The paper tackles multiclass abnormality classification in Video Capsule Endoscopy (VCE) by evaluating a progression of architectures from a CNN baseline to advanced transformer models (ViT, MViT, DaViT). DaViT emerges as the strongest performer on validation, achieving a high mean AUC and balanced accuracy, while test-time results reveal a gap and a 7th-place ranking in Capsule Vision 2024. The study demonstrates the potential of transformer-based approaches to model complex visual patterns in medical video data, suggesting that multi-scale and dual-attention designs can enhance diagnostic accuracy. The findings emphasize both the promise and the challenges of generalization to diverse clinical data, highlighting the need for more diverse datasets and robustness-focused improvements for real-world deployment.

Abstract

In this work for Capsule Vision Challenge 2024, we addressed the challenge of multiclass anomaly classification in video capsule Endoscopy (VCE)[1] with a variety of deep learning models, ranging from custom CNNs to advanced transformer architectures. The purpose is to correctly classify diverse gastrointestinal disorders, which is critical for increasing diagnostic efficiency in clinical settings. We started with a baseline CNN model and improved performance with ResNet[2] for better feature extraction, followed by Vision Transformer (ViT)[3] to capture global dependencies. We further improve the results by using Multiscale Vision Transformer (MViT)[4] for improved hierarchical feature extraction, while Dual Attention Vision Transformer (DaViT) [5] delivered best results by combining spatial and channel attention methods. Our best balanced accuracy on validation set [6] was 0.8592 and Mean AUC was 0.9932. This methodology enabled us to improve model accuracy across a wide range of criteria, greatly surpassing all other methods.Additionally, our team capsule commandos achieved 7th place ranking with a test set[7] performance of Mean AUC: 0.7314 and balanced accuracy: 0.3235

Multi-Class Abnormality Classification Task in Video Capsule Endoscopy

TL;DR

The paper tackles multiclass abnormality classification in Video Capsule Endoscopy (VCE) by evaluating a progression of architectures from a CNN baseline to advanced transformer models (ViT, MViT, DaViT). DaViT emerges as the strongest performer on validation, achieving a high mean AUC and balanced accuracy, while test-time results reveal a gap and a 7th-place ranking in Capsule Vision 2024. The study demonstrates the potential of transformer-based approaches to model complex visual patterns in medical video data, suggesting that multi-scale and dual-attention designs can enhance diagnostic accuracy. The findings emphasize both the promise and the challenges of generalization to diverse clinical data, highlighting the need for more diverse datasets and robustness-focused improvements for real-world deployment.

Abstract

In this work for Capsule Vision Challenge 2024, we addressed the challenge of multiclass anomaly classification in video capsule Endoscopy (VCE)[1] with a variety of deep learning models, ranging from custom CNNs to advanced transformer architectures. The purpose is to correctly classify diverse gastrointestinal disorders, which is critical for increasing diagnostic efficiency in clinical settings. We started with a baseline CNN model and improved performance with ResNet[2] for better feature extraction, followed by Vision Transformer (ViT)[3] to capture global dependencies. We further improve the results by using Multiscale Vision Transformer (MViT)[4] for improved hierarchical feature extraction, while Dual Attention Vision Transformer (DaViT) [5] delivered best results by combining spatial and channel attention methods. Our best balanced accuracy on validation set [6] was 0.8592 and Mean AUC was 0.9932. This methodology enabled us to improve model accuracy across a wide range of criteria, greatly surpassing all other methods.Additionally, our team capsule commandos achieved 7th place ranking with a test set[7] performance of Mean AUC: 0.7314 and balanced accuracy: 0.3235

Paper Structure

This paper contains 12 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Block diagram of the developed DaViT pipeline. DaViT model, same as in ref:davit.
  • Figure 2: Confusion Matrix for the DaViT Model on the validation set.
  • Figure 3: ROC Curve for the DaViT Model on the validation set.