Table of Contents
Fetching ...

REAL-Colon: A dataset for developing real-world AI applications in colonoscopy

Carlo Biffi, Giulio Antonelli, Sebastian Bernhofer, Cesare Hassan, Daizen Hirata, Mineo Iwatate, Andreas Maieron, Pietro Salvagnini, Andrea Cherubini

TL;DR

The REAL-Colon (Real-world multi-center Endoscopy Annotated video Library) dataset is introduced: a compilation of 2.7 M native video frames from sixty full-resolution, real-world colonoscopy recordings across multiple centers, a unique resource for researchers and developers aiming to advance AI research in colonoscopy.

Abstract

Detection and diagnosis of colon polyps are key to preventing colorectal cancer. Recent evidence suggests that AI-based computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems can enhance endoscopists' performance and boost colonoscopy effectiveness. However, most available public datasets primarily consist of still images or video clips, often at a down-sampled resolution, and do not accurately represent real-world colonoscopy procedures. We introduce the REAL-Colon (Real-world multi-center Endoscopy Annotated video Library) dataset: a compilation of 2.7M native video frames from sixty full-resolution, real-world colonoscopy recordings across multiple centers. The dataset contains 350k bounding-box annotations, each created under the supervision of expert gastroenterologists. Comprehensive patient clinical data, colonoscopy acquisition information, and polyp histopathological information are also included in each video. With its unprecedented size, quality, and heterogeneity, the REAL-Colon dataset is a unique resource for researchers and developers aiming to advance AI research in colonoscopy. Its openness and transparency facilitate rigorous and reproducible research, fostering the development and benchmarking of more accurate and reliable colonoscopy-related algorithms and models.

REAL-Colon: A dataset for developing real-world AI applications in colonoscopy

TL;DR

The REAL-Colon (Real-world multi-center Endoscopy Annotated video Library) dataset is introduced: a compilation of 2.7 M native video frames from sixty full-resolution, real-world colonoscopy recordings across multiple centers, a unique resource for researchers and developers aiming to advance AI research in colonoscopy.

Abstract

Detection and diagnosis of colon polyps are key to preventing colorectal cancer. Recent evidence suggests that AI-based computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems can enhance endoscopists' performance and boost colonoscopy effectiveness. However, most available public datasets primarily consist of still images or video clips, often at a down-sampled resolution, and do not accurately represent real-world colonoscopy procedures. We introduce the REAL-Colon (Real-world multi-center Endoscopy Annotated video Library) dataset: a compilation of 2.7M native video frames from sixty full-resolution, real-world colonoscopy recordings across multiple centers. The dataset contains 350k bounding-box annotations, each created under the supervision of expert gastroenterologists. Comprehensive patient clinical data, colonoscopy acquisition information, and polyp histopathological information are also included in each video. With its unprecedented size, quality, and heterogeneity, the REAL-Colon dataset is a unique resource for researchers and developers aiming to advance AI research in colonoscopy. Its openness and transparency facilitate rigorous and reproducible research, fostering the development and benchmarking of more accurate and reliable colonoscopy-related algorithms and models.
Paper Structure (17 sections, 7 figures, 5 tables)

This paper contains 17 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Flowchart outlining the two-phase selection process for creating the REAL-Colon dataset from 368 video recordings across four distrinct cohorts. Phase 1 applies a penalty scoring system based on video and histological criteria, leading to Phase 2, where the 15 videos per cohort are manually selected, after ranking, to ensure diversity and representation while maintaining the cohort average lesion count.
  • Figure 2: Clinical Data Distribution. This figure presents histograms depicting the distribution of sex, age, polyp count per procedure, BBPS scores, endoscope brand, and procedure duration within the REAL-Colon dataset.
  • Figure 3: Polyp Characteristics Distribution. The histograms in this figure highlight the distribution of the anatomical location, size (in millimeters), and histology of the polyps included in the REAL-Colon dataset.
  • Figure 4: Left, a histogram displaying the number of boxes per frame. On the right, the distribution of the number of bounding boxes associated to each polyp.
  • Figure 5: Left: Histogram displaying the number of tracklets per polyp, using a 1-second threshold to identify separate tracklets. The x-axis represents the number of tracklets associated with each polyp, while the y-axis shows the count of polyps with that number of tracklets. Right: Plot illustrating the decrease in the number of tracklets as a function of the disappearance threshold. Here, the x-axis signifies the disappearance threshold in seconds, which determines when a new tracklet is created once a polyp disappears for longer than the threshold duration. The y-axis reports the resulting number of tracklets.
  • ...and 2 more figures