Table of Contents
Fetching ...

CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

Guankun Wang, Han Xiao, Huxin Gao, Renrui Zhang, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li, Hongliang Ren

TL;DR

CoPESD introduces a first multimodal, multi-level surgical motion dataset for Endoscopic Submucosal Dissection (ESD) to train large vision-language models as a co-pilot. It defines a five-level motion granularity and collects 17,679 images with 32,699 bounding boxes and 88,395 labeled motions from over 35 hours of robot-assisted and conventional ESD videos, enabling fine-grained motion instruction-following. Fine-tuning SPHINX-X and LLaVA-1.5 with LLaMA-2 backbones on CoPESD yields strong GPT-based response quality (approximately 84–86) and robust grounding (mean IoU around 60–70), with higher image resolution and larger models further boosting performance. The dataset is publicly available and paves the way for LVLM-driven ESD automation and safer, more precise robotic endoscopy, with future work aimed at incorporating temporal information for dynamic prediction.

Abstract

submucosal dissection (ESD) enables rapid resection of large lesions, minimizing recurrence rates and improving long-term overall survival. Despite these advantages, ESD is technically challenging and carries high risks of complications, necessitating skilled surgeons and precise instruments. Recent advancements in Large Visual-Language Models (LVLMs) offer promising decision support and predictive planning capabilities for robotic systems, which can augment the accuracy of ESD and reduce procedural risks. However, existing datasets for multi-level fine-grained ESD surgical motion understanding are scarce and lack detailed annotations. In this paper, we design a hierarchical decomposition of ESD motion granularity and introduce a multi-level surgical motion dataset (CoPESD) for training LVLMs as the robotic \textbf{Co}-\textbf{P}ilot of \textbf{E}ndoscopic \textbf{S}ubmucosal \textbf{D}issection. CoPESD includes 17,679 images with 32,699 bounding boxes and 88,395 multi-level motions, from over 35 hours of ESD videos for both robot-assisted and conventional surgeries. CoPESD enables granular analysis of ESD motions, focusing on the complex task of submucosal dissection. Extensive experiments on the LVLMs demonstrate the effectiveness of CoPESD in training LVLMs to predict following surgical robotic motions. As the first multimodal ESD motion dataset, CoPESD supports advanced research in ESD instruction-following and surgical automation. The dataset is available at \href{https://github.com/gkw0010/CoPESD}{https://github.com/gkw0010/CoPESD.}}

CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

TL;DR

CoPESD introduces a first multimodal, multi-level surgical motion dataset for Endoscopic Submucosal Dissection (ESD) to train large vision-language models as a co-pilot. It defines a five-level motion granularity and collects 17,679 images with 32,699 bounding boxes and 88,395 labeled motions from over 35 hours of robot-assisted and conventional ESD videos, enabling fine-grained motion instruction-following. Fine-tuning SPHINX-X and LLaVA-1.5 with LLaMA-2 backbones on CoPESD yields strong GPT-based response quality (approximately 84–86) and robust grounding (mean IoU around 60–70), with higher image resolution and larger models further boosting performance. The dataset is publicly available and paves the way for LVLM-driven ESD automation and safer, more precise robotic endoscopy, with future work aimed at incorporating temporal information for dynamic prediction.

Abstract

submucosal dissection (ESD) enables rapid resection of large lesions, minimizing recurrence rates and improving long-term overall survival. Despite these advantages, ESD is technically challenging and carries high risks of complications, necessitating skilled surgeons and precise instruments. Recent advancements in Large Visual-Language Models (LVLMs) offer promising decision support and predictive planning capabilities for robotic systems, which can augment the accuracy of ESD and reduce procedural risks. However, existing datasets for multi-level fine-grained ESD surgical motion understanding are scarce and lack detailed annotations. In this paper, we design a hierarchical decomposition of ESD motion granularity and introduce a multi-level surgical motion dataset (CoPESD) for training LVLMs as the robotic \textbf{Co}-\textbf{P}ilot of \textbf{E}ndoscopic \textbf{S}ubmucosal \textbf{D}issection. CoPESD includes 17,679 images with 32,699 bounding boxes and 88,395 multi-level motions, from over 35 hours of ESD videos for both robot-assisted and conventional surgeries. CoPESD enables granular analysis of ESD motions, focusing on the complex task of submucosal dissection. Extensive experiments on the LVLMs demonstrate the effectiveness of CoPESD in training LVLMs to predict following surgical robotic motions. As the first multimodal ESD motion dataset, CoPESD supports advanced research in ESD instruction-following and surgical automation. The dataset is available at \href{https://github.com/gkw0010/CoPESD}{https://github.com/gkw0010/CoPESD.}}

Paper Structure

This paper contains 32 sections, 4 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Illustration of Endoscopic Submucosal Dissection with different system instruments. (a) ESD surgery in the gastric body. (b) Endoscopic view of DREAMS-assisted ESD instruments. (c) Endoscopic view of conventional ESD instruments. (d) Multi-level surgical motion instruction demonstrations in CoPESD.
  • Figure 2: Overview of different levels of surgical motion granularity for Endoscopic Submucosal Dissection.
  • Figure 3: Overview of the construction pipeline for our CoPESD dataset, involving four key steps: video extraction, motion enrichment, bounding box annotation, and data aggregation.
  • Figure 4: (a) Distribution of all images regarding the collection of surgical information. GB indicates Gastric Body. (b) Number of images within each surgeme type. (c) Distribution of entities across the top 20 navigating motion primitive types.
  • Figure 5: Demonstrations of the output surgical robot actions from LVLMs after fine-tuning on the proposed CoPESD dataset.
  • ...and 3 more figures