Table of Contents
Fetching ...

COLON: The largest COlonoscopy LONg sequence public database

Lina Ruiz, Franklin Sierra-Jerez, Jair Ruiz, Fabio Martinez

TL;DR

colorectal cancer (CRC) screening faces polyp detection challenges in long colonoscopy procedures where polyps occupy only a small fraction of frames. The authors introduce COLON, the largest long-sequence colonoscopy dataset, featuring about 30k polyp-labeled frames and 400k background frames from 30 colonoscopies, plus 10 background sequences, to enable polyp segmentation and localization in realistic, background-rich videos. They define long-sequence segmentation and localization tasks (with frame-level scoring $S_t = α μ(S_{wp}) + (1-α) μ(S_{ob})$ and IoU-based metrics) and provide baseline evaluations of three state-of-the-art methods, revealing substantial gaps between performance on cropped data and long-sequence realism. An online benchmarking platform accompanies COLON to foster community-driven development of robust, clinically applicable polyp detection and segmentation in real colonoscopy workflows.

Abstract

Colorectal cancer is the third most aggressive cancer worldwide. Polyps, as the main biomarker of the disease, are detected, localized, and characterized through colonoscopy procedures. Nonetheless, during the examination, up to 25% of polyps are missed, because of challenging conditions (camera movements, lighting changes), and the close similarity of polyps and intestinal folds. Besides, there is a remarked subjectivity and expert dependency to observe and detect abnormal regions along the intestinal tract. Currently, publicly available polyp datasets have allowed significant advances in computational strategies dedicated to characterizing non-parametric polyp shapes. These computational strategies have achieved remarkable scores of up to 90% in segmentation tasks. Nonetheless, these strategies operate on cropped and expert-selected frames that always observe polyps. In consequence, these computational approximations are far from clinical scenarios and real applications, where colonoscopies are redundant on intestinal background with high textural variability. In fact, the polyps typically represent less than 1% of total observations in a complete colonoscopy record. This work introduces COLON: the largest COlonoscopy LONg sequence dataset with around of 30 thousand polyp labeled frames and 400 thousand background frames. The dataset was collected from a total of 30 complete colonoscopies with polyps at different stages, variations in preparation procedures, and some cases the observation of surgical instrumentation. Additionally, 10 full intestinal background video control colonoscopies were integrated in order to achieve a robust polyp-background frame differentiation. The COLON dataset is open to the scientific community to bring new scenarios to propose computational tools dedicated to polyp detection and segmentation over long sequences, being closer to real colonoscopy scenarios.

COLON: The largest COlonoscopy LONg sequence public database

TL;DR

colorectal cancer (CRC) screening faces polyp detection challenges in long colonoscopy procedures where polyps occupy only a small fraction of frames. The authors introduce COLON, the largest long-sequence colonoscopy dataset, featuring about 30k polyp-labeled frames and 400k background frames from 30 colonoscopies, plus 10 background sequences, to enable polyp segmentation and localization in realistic, background-rich videos. They define long-sequence segmentation and localization tasks (with frame-level scoring and IoU-based metrics) and provide baseline evaluations of three state-of-the-art methods, revealing substantial gaps between performance on cropped data and long-sequence realism. An online benchmarking platform accompanies COLON to foster community-driven development of robust, clinically applicable polyp detection and segmentation in real colonoscopy workflows.

Abstract

Colorectal cancer is the third most aggressive cancer worldwide. Polyps, as the main biomarker of the disease, are detected, localized, and characterized through colonoscopy procedures. Nonetheless, during the examination, up to 25% of polyps are missed, because of challenging conditions (camera movements, lighting changes), and the close similarity of polyps and intestinal folds. Besides, there is a remarked subjectivity and expert dependency to observe and detect abnormal regions along the intestinal tract. Currently, publicly available polyp datasets have allowed significant advances in computational strategies dedicated to characterizing non-parametric polyp shapes. These computational strategies have achieved remarkable scores of up to 90% in segmentation tasks. Nonetheless, these strategies operate on cropped and expert-selected frames that always observe polyps. In consequence, these computational approximations are far from clinical scenarios and real applications, where colonoscopies are redundant on intestinal background with high textural variability. In fact, the polyps typically represent less than 1% of total observations in a complete colonoscopy record. This work introduces COLON: the largest COlonoscopy LONg sequence dataset with around of 30 thousand polyp labeled frames and 400 thousand background frames. The dataset was collected from a total of 30 complete colonoscopies with polyps at different stages, variations in preparation procedures, and some cases the observation of surgical instrumentation. Additionally, 10 full intestinal background video control colonoscopies were integrated in order to achieve a robust polyp-background frame differentiation. The COLON dataset is open to the scientific community to bring new scenarios to propose computational tools dedicated to polyp detection and segmentation over long sequences, being closer to real colonoscopy scenarios.
Paper Structure (7 sections, 1 equation, 3 figures, 2 tables)

This paper contains 7 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Comparison between public datasets (CVC-ClinicDB bernal2015wm, ETIS-Larib silva2014toward, CVC-300 bernal2012towards, CVC-ClinicHD sanchez2019computerbernal2019gtcreator, Kvasir Pogorelov:2017, ASU-Mayo tajbakhsh2015automated, CVC-Video angermann2017towardsbernal2018polyp) available since 2015 and our proposed dataset for 2023 associated to COLON challenge.
  • Figure 2: Polyp description according to the size, the morpho-logy (sessile or pedunculated), NICE classification, and the biopsy result. The bottom figure shows the demographic information.
  • Figure 3: Frames extracted from the captured colonoscopy sequences. The first two rows contain polyps with their respective marking (green contour). The bottom row shows typical intestinal regions prone to be misidentified as polyps due to their similar polyp patterns (blue arrows).