Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning
Ge-Peng Ji, Jingyi Liu, Deng-Ping Fan, Nick Barnes
TL;DR
Colon-X introduces ColonVQA as a million-scale multimodal colonoscopy dataset and analyzes current multimodal understanding through ColonEval and ColonPert to reveal robustness gaps. It then advances clinical reasoning with ColonReason and ColonR1, demonstrating substantial gains in reasoning-enabled performance under data-scarce conditions. The work highlights the importance of data quality, diverse task coverage, and reasoning-centric supervision for pushing laboratory insights toward real-world clinical utility. Collectively, it sets a data-centered, open foundation for next-generation intelligent colonoscopy with broad implications for medical multimodal research.
Abstract
In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.
