Reliable Mislabel Detection for Video Capsule Endoscopy Data
Julia Werner, Julius Oexle, Oliver Bause, Maxime Le Floch, Franz Brinkmann, Hannah Tolle, Jochen Hampe, Oliver Bringmann
TL;DR
This work tackles mislabeled data in Video Capsule Endoscopy by introducing a mislabel-detection and cleaning pipeline that uses a three-CNN/GMM workflow to estimate per-sample noise, correct or filter suspected labels, and retrain CNNs for improved anomaly detection. The methodology is evaluated in two stages: controlled noise injection on Kvasir-Capsule and real-world cleaning on the Galar dataset with clinical validation by gastroenterologists, achieving marked improvements over uncleaned data and existing baselines. Key results include a final dev-set accuracy of $93.83\%$ and F1 of $71.58\%$ on Galar after cleaning, and a Precision@100 of $78\%$ in clinician validation, underscoring practical potential for dataset cleaning prior to training. The work provides a concrete, verifiable path toward more reliable medical image classification and, potentially, on-device anomaly detection in VCE systems.
Abstract
The classification performance of deep neural networks relies strongly on access to large, accurately annotated datasets. In medical imaging, however, obtaining such datasets is particularly challenging since annotations must be provided by specialized physicians, which severely limits the pool of annotators. Furthermore, class boundaries can often be ambiguous or difficult to define which further complicates machine learning-based classification. In this paper, we want to address this problem and introduce a framework for mislabel detection in medical datasets. This is validated on the two largest, publicly available datasets for Video Capsule Endoscopy, an important imaging procedure for examining the gastrointestinal tract based on a video stream of lowresolution images. In addition, potentially mislabeled samples identified by our pipeline were reviewed and re-annotated by three experienced gastroenterologists. Our results show that the proposed framework successfully detects incorrectly labeled data and results in an improved anomaly detection performance after cleaning the datasets compared to current baselines.
