Table of Contents
Fetching ...

Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning

Ge-Peng Ji, Jingyi Liu, Deng-Ping Fan, Nick Barnes

TL;DR

Colon-X introduces ColonVQA as a million-scale multimodal colonoscopy dataset and analyzes current multimodal understanding through ColonEval and ColonPert to reveal robustness gaps. It then advances clinical reasoning with ColonReason and ColonR1, demonstrating substantial gains in reasoning-enabled performance under data-scarce conditions. The work highlights the importance of data quality, diverse task coverage, and reasoning-centric supervision for pushing laboratory insights toward real-world clinical utility. Collectively, it sets a data-centered, open foundation for next-generation intelligent colonoscopy with broad implications for medical multimodal research.

Abstract

In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.

Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning

TL;DR

Colon-X introduces ColonVQA as a million-scale multimodal colonoscopy dataset and analyzes current multimodal understanding through ColonEval and ColonPert to reveal robustness gaps. It then advances clinical reasoning with ColonReason and ColonR1, demonstrating substantial gains in reasoning-enabled performance under data-scarce conditions. The work highlights the importance of data quality, diverse task coverage, and reasoning-centric supervision for pushing laboratory insights toward real-world clinical utility. Collectively, it sets a data-centered, open foundation for next-generation intelligent colonoscopy with broad implications for medical multimodal research.

Abstract

In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.

Paper Structure

This paper contains 32 sections, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Research roadmap of Colon-X project. Building upon the most comprehensive multimodal colonoscopy database (ColonVQA as detailed in §\ref{['sec:colonvqa']}), we propel a pivotal transition in intelligent colonoscopy, evolving from multimodal understanding (ColonEval in §\ref{['sec:coloneval']} & ColonPert in §\ref{['sec:colonpert']}) to clinical reasoning (ColonReason in §\ref{['sec:colonreason']} & ColonR1 in §\ref{['sec:colonr1']}). These efforts collectively illuminate the path to next-generation advances in clinical colonoscopy and broader medical applications.
  • Figure 2: Gallery of representative VQA samples from our ColonVQA. All 18 multimodal tasks are organized into a five-level taxonomy, reflecting the typical workflows in clinical colonoscopy. The statistics of each task category are summarized in Table \ref{['tab2_b']}.
  • Figure 3: Illustration of four human-induced perturbations.
  • Figure 4: Data curation pipeline for ColonReason (§\ref{['sec:colonreason']}). Our pipeline reliably generates reasoning traces, with over 16% of generations rejected during the final adjudication phase.
  • Figure 5: Design of ColonR1 (§\ref{['sec:colonr1']}). We extend the native GRPO guo2024deepseekcoder by proposing a task-adaptive reward scheme adapted to various colonoscopy tasks. Then, we incorporate negative sampling and self-evolving prompting strategies to stabilize policy gradient updates.
  • ...and 6 more figures