Table of Contents
Fetching ...

A Robust Pipeline for Classification and Detection of Bleeding Frames in Wireless Capsule Endoscopy using Swin Transformer and RT-DETR

Sasidhar Alavala, Anil Kumar Vadde, Aparnamala Kancheti, Subrahmanyam Gorthi

TL;DR

This paper tackles automatic bleeding frame classification and bleeding-region detection in wireless capsule endoscopy videos. It proposes a two-stage pipeline that first classifies frames using SwinV2 and then detects/segments bleeding regions with RT-DETR, enhanced by Lab color space conversion, CLAHE, and Gaussian blur. Ablation-CAM is used to provide interpretability, and the approach achieves state-of-the-art validation metrics (classification accuracy up to 98.5% with preprocessing; AP50 66.7%), with competitive test performance (87.0% accuracy and 89.0% F1). The combination of robust preprocessing, transformer-based classification, efficient multi-scale detection, and explainability contributes to robust, fast WCE bleeding analysis with potential clinical impact.

Abstract

In this paper, we present our approach to the Auto WCEBleedGen Challenge V2 2024. Our solution combines the Swin Transformer for the initial classification of bleeding frames and RT-DETR for further detection of bleeding in Wireless Capsule Endoscopy (WCE), enhanced by a series of image preprocessing steps. These steps include converting images to Lab colour space, applying Contrast Limited Adaptive Histogram Equalization (CLAHE) for better contrast, and using Gaussian blur to suppress artefacts. The Swin Transformer utilizes a tiered architecture with shifted windows to efficiently manage self-attention calculations, focusing on local windows while enabling cross-window interactions. RT-DETR features an efficient hybrid encoder for fast processing of multi-scale features and an uncertainty-minimal query selection for enhanced accuracy. The class activation maps by Ablation-CAM are plausible to the model's decisions. On the validation set, this approach achieves a classification accuracy of 98.5% (best among the other state-of-the-art models) compared to 91.7% without any pre-processing and an $\text{AP}_{50}$ of 66.7% compared to 65.0% with state-of-the-art YOLOv8. On the test set, this approach achieves a classification accuracy and F1 score of 87.0% and 89.0% respectively.

A Robust Pipeline for Classification and Detection of Bleeding Frames in Wireless Capsule Endoscopy using Swin Transformer and RT-DETR

TL;DR

This paper tackles automatic bleeding frame classification and bleeding-region detection in wireless capsule endoscopy videos. It proposes a two-stage pipeline that first classifies frames using SwinV2 and then detects/segments bleeding regions with RT-DETR, enhanced by Lab color space conversion, CLAHE, and Gaussian blur. Ablation-CAM is used to provide interpretability, and the approach achieves state-of-the-art validation metrics (classification accuracy up to 98.5% with preprocessing; AP50 66.7%), with competitive test performance (87.0% accuracy and 89.0% F1). The combination of robust preprocessing, transformer-based classification, efficient multi-scale detection, and explainability contributes to robust, fast WCE bleeding analysis with potential clinical impact.

Abstract

In this paper, we present our approach to the Auto WCEBleedGen Challenge V2 2024. Our solution combines the Swin Transformer for the initial classification of bleeding frames and RT-DETR for further detection of bleeding in Wireless Capsule Endoscopy (WCE), enhanced by a series of image preprocessing steps. These steps include converting images to Lab colour space, applying Contrast Limited Adaptive Histogram Equalization (CLAHE) for better contrast, and using Gaussian blur to suppress artefacts. The Swin Transformer utilizes a tiered architecture with shifted windows to efficiently manage self-attention calculations, focusing on local windows while enabling cross-window interactions. RT-DETR features an efficient hybrid encoder for fast processing of multi-scale features and an uncertainty-minimal query selection for enhanced accuracy. The class activation maps by Ablation-CAM are plausible to the model's decisions. On the validation set, this approach achieves a classification accuracy of 98.5% (best among the other state-of-the-art models) compared to 91.7% without any pre-processing and an of 66.7% compared to 65.0% with state-of-the-art YOLOv8. On the test set, this approach achieves a classification accuracy and F1 score of 87.0% and 89.0% respectively.
Paper Structure (8 sections, 3 figures, 2 tables)

This paper contains 8 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Proposed bleeding detection pipeline.
  • Figure 2: Block level RT-DETR architecture lv2023detrs.
  • Figure 3: Visual results of our approach on test-1 marked images (top row) and test-2 unmarked images (bottom row).