EndoDINO: A Foundation Model for GI Endoscopy

Patrick Dermyer; Angad Kalra; Matt Schwartz

EndoDINO: A Foundation Model for GI Endoscopy

Patrick Dermyer, Angad Kalra, Matt Schwartz

TL;DR

EndoDINO addresses the need for a GI endoscopy foundation model capable of generalizing across diverse tasks without task-specific training. It pre-trains ViT backbones on a massively curated image dataset derived from the largest GI endoscopy video collection using self-supervised learning (DINOv2), and evaluates frozen-backbone + simple decoder heads on anatomical landmark classification, polyp segmentation, and Mayo endoscopic scoring. The results show state-of-the-art performance across these tasks and robust generalization to data from unrelated capture efforts, with notable few-shot capabilities and reduced labeling requirements. The work highlights practical benefits for real-time GI AI systems, proposing to scale data further and explore emergent capabilities for precision medicine.

Abstract

In this work, we present EndoDINO, a foundation model for GI endoscopy tasks that achieves strong generalizability by pre-training on a well-curated image dataset sampled from the largest known GI endoscopy video dataset in the literature. Specifically, we pre-trained ViT models with 1B, 307M, and 86M parameters using datasets ranging from 100K to 10M curated images. Using EndoDINO as a frozen feature encoder, we achieved state-of-the-art performance in anatomical landmark classification, polyp segmentation, and Mayo endoscopic scoring (MES) for ulcerative colitis with only simple decoder heads.

EndoDINO: A Foundation Model for GI Endoscopy

TL;DR

Abstract

Paper Structure (19 sections, 3 figures, 5 tables)

This paper contains 19 sections, 3 figures, 5 tables.

Introduction
Related Work
Natural Images
Radiology
Pathology
GI Endoscopy
Contributions
Data and Methods
Data Curation
Pre-training
Evaluations
HyperKvasir
LIMUC
Experiments
Anatomical Landmark Classification
...and 4 more sections

Figures (3)

Figure 1: Total DINOv2 training loss in blue, compared to performance on downstream LIMUC 4 class MES task in red, by training step. Peak performance on this downstream task occurs long after total DINOv2 loss is minimized, highlighting the importance of selecting model checkpoints based on downstream task performance.
Figure 2: Representation of classes in the LIMUC dataset showing significant class imbalance that favors lower MES scores.
Figure 3: Representative examples of KvasirSEG polyp images alongside ground truth masks and our predictions for EndoDINO ViT-L/14 with DPT head and EndoDINO ViT-g/14 with Boosted Linear head models.

EndoDINO: A Foundation Model for GI Endoscopy

TL;DR

Abstract

EndoDINO: A Foundation Model for GI Endoscopy

Authors

TL;DR

Abstract

Table of Contents

Figures (3)