Table of Contents
Fetching ...

LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings

Chengan Che, Chao Wang, Tom Vercauteren, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

Abstract

Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos and less than 30 hours of footage, which leads to poor model generalization. To address this data limitation, a new dataset called LEMON has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos totaling 938 hours (85 million frames) of high-quality footage across multiple procedure types, LEMON offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel downstream tasks. To demonstrate the effectiveness of this diverse dataset, we introduce LemonFM, a foundation model pretrained on LEMON using a novel self-supervised augmented knowledge distillation approach. LemonFM consistently outperforms existing surgical foundation models across four downstream tasks and six datasets, achieving significant gains in surgical phase recognition (+9.5pp, +9.4pp, and +8.4pp in Jaccard on AutoLaparo, M2CAI16, and Cholec80), surgical action recognition (+4.4pp in mAP on CholecT50), surgical tool presence detection (+5.3pp and +10.2pp in mAP on Cholec80 and GraSP), and surgical semantic segmentation (+10.3pp in mDice on CholecSeg8k). LEMON and LemonFM will serve as foundational resources for the research community and industry, accelerating progress in developing autonomous robotic surgery systems and ultimately contributing to safer and more accessible surgical care worldwide. Dataset, code, and models are publicly available at https://github.com/visurg-ai/LEMON.

LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings

Abstract

Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos and less than 30 hours of footage, which leads to poor model generalization. To address this data limitation, a new dataset called LEMON has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos totaling 938 hours (85 million frames) of high-quality footage across multiple procedure types, LEMON offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel downstream tasks. To demonstrate the effectiveness of this diverse dataset, we introduce LemonFM, a foundation model pretrained on LEMON using a novel self-supervised augmented knowledge distillation approach. LemonFM consistently outperforms existing surgical foundation models across four downstream tasks and six datasets, achieving significant gains in surgical phase recognition (+9.5pp, +9.4pp, and +8.4pp in Jaccard on AutoLaparo, M2CAI16, and Cholec80), surgical action recognition (+4.4pp in mAP on CholecT50), surgical tool presence detection (+5.3pp and +10.2pp in mAP on Cholec80 and GraSP), and surgical semantic segmentation (+10.3pp in mDice on CholecSeg8k). LEMON and LemonFM will serve as foundational resources for the research community and industry, accelerating progress in developing autonomous robotic surgery systems and ultimately contributing to safer and more accessible surgical care worldwide. Dataset, code, and models are publicly available at https://github.com/visurg-ai/LEMON.

Paper Structure

This paper contains 31 sections, 2 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Performance comparison between LemonFM, our foundation model pretrained on LEMON, and other surgical foundation models. Data points are plotted relative to the numeric axis ticks. The full results are available in Table \ref{['tab:finetuned_classification']} and \ref{['tab:segmentation']}.
  • Figure 2: Data curation pipeline. (a) Surgical videos are collected from YouTube. (b) The videos are summarized as storyboards (trivial to annotate and classify). We train a storyboard classifier to identify videos with rich surgical content, and manually verify the selected videos. (c) During the video selection and trimming phase, a trained frame classifier categorizes each frame as either surgical or non-surgical, allowing us to trim non-surgical frame segments at the beginning and end of each video. We further filter videos by requiring at least 90% of their frames to be surgical frames. (d) In the video preprocessing stage, we obliterate non-surgical frames and non-surgical regions within surgical frames. (e) During the video annotation phase, we utilize the video title as a primary cue to determine the procedure and surgery type. In cases where the video titles do not explicitly include any of our procedure name keywords, we employ ChatGPT to match the titles with surgical procedure types. Both the final videos and their corresponding labels are manually quality controlled.
  • Figure 3: Comparison between LEMON and other surgical datasets sourced from the web. Video segments from each dataset illustrating the data curation differences (left). Procedures which are not covered by both GenSurgery and SurgeNetXL or any other public dataset (right). Best viewed online.
  • Figure 4: Proposed augmented distillation method. To encourage invariance in LemonFM to minor surgical motion and subtle appearance changes across patients, we introduce $W_i$ (Eq. \ref{['eq:loss']}). The key component of $W_i$ is a pair of images (c) that are randomly selected from an augmentation pool (b). To populate the augmentation pool, which has capacity for four images, we first retrieve the nearest neighbors$^{(1)}$ of the input image from other videos of the same procedure type, but only include them in the pool if the cosine distance to the input image is smaller than $3\times$ the distance between the input image and its preceding frame in the video (the choice of $3\times$ factor and the cosine distance is justified with ablation experiments in the supplementary material). When not enough suitable neighbors are found, we supplement the pool with adjacent video frames$^{(2)(3)}$.
  • Figure 5: Diversity and procedure prevalence in LEMON. Representative samples from various procedures, demonstrating the diverse range of cases in our curated dataset (left, right). Distribution of surgical frames by procedure type (center).
  • ...and 1 more figures