Table of Contents
Fetching ...

Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

Abdullah Hamdi, Changchun Yang, Xin Gao

Abstract

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .

Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

Abstract

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .

Paper Structure

This paper contains 18 sections, 20 figures, 18 tables.

Figures (20)

  • Figure 1: Colon-Bench Annotation Pipeline. Overview of the multi-stage agentic workflow used to build the dataset, from VLM proposal generation through verification, tracking with spatial annotations, AI confirmation, and clinician review. The figure highlights how successive filters reduce false positives while adding dense spatial and textual labels for the retained windows, all easily verified by a physician at the end. The accepted annotations are then used to create a comprehensive Colon-Bench for MLLMs covering multiple video tasks: Visual Quesion Answering (VQA), binary lesion classification, and Open-Vocabulary Video Object Segmentation (OV-VOS).
  • Figure 2: Lesion Category Distribution. Long-tailed lesion category distribution in Colon-Bench, highlighting the diversity of lesions.
  • Figure 3: Qualitative Comparisons. The first row shows detection (Sessile Polyp); the remaining rows show segmentation (Sessile Polyp & Erythematous Region). Columns show the input, ground truth, and model outputs. SAM-3 underperforms compared to modern MLLMs when paired with a prompted tracker such as EdgeTAM zhou2025edgetam.
  • Figure 4: Multi-Modal Large Language Models for Lesion Segmentation. Video object segmentation results on the 272-video benchmark, where MLLMs gpt4bai2025qwen3deitke2025molmoge2023plantingglm4_2024gemini2024gemini15 first localize the target lesion with 3 bounding boxes and a EdgeTAM tracker zhou2025edgetam propagates masks.
  • Figure 5: Colon-Skill on Prompted VQA. Skill-augmented prompting on the prompted VQA split, where a distilled error-based skill context yields consistent accuracy improvements for higher-capacity models.
  • ...and 15 more figures