Table of Contents
Fetching ...

EndoMamba: An Efficient Foundation Model for Endoscopic Videos via Hierarchical Pre-training

Qingyao Tian, Huai Liao, Xinyan Huang, Bingyu Yang, Dongdong Lei, Sebastien Ourselin, Hongbin Liu

TL;DR

EndoMamba tackles the bottlenecks of endoscopic foundation models by combining a recurrent, memory-friendly Mamba-based backbone with a two-tier pre-training scheme that learns spatiotemporal structure from masked reconstruction and transfers broad knowledge from a general-domain video model through feature alignment. The result is a model that delivers real-time inference speeds (up to 46.7 FPS) while achieving state-of-the-art performance on classification, segmentation, surgical phase recognition, and localization tasks in endoscopy. By leveraging hierarchical pre-training, EndoMamba mitigates limited endoscopic data, enabling robust transfer from general-domain supervision to clinical tasks. Practically, this approach enables real-time, generalized endoscopic video understanding suitable for intraoperative guidance and automated analysis.

Abstract

Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, we propose a self-supervised hierarchical pre-training diagram to enhance EndoMamba's representation learning using endoscopic videos and incorporating general video domain knowledge. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks--classification, segmentation, surgical phase recognition, and localization--demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code is available at https://github.com/TianCuteQY/EndoMamba.

EndoMamba: An Efficient Foundation Model for Endoscopic Videos via Hierarchical Pre-training

TL;DR

EndoMamba tackles the bottlenecks of endoscopic foundation models by combining a recurrent, memory-friendly Mamba-based backbone with a two-tier pre-training scheme that learns spatiotemporal structure from masked reconstruction and transfers broad knowledge from a general-domain video model through feature alignment. The result is a model that delivers real-time inference speeds (up to 46.7 FPS) while achieving state-of-the-art performance on classification, segmentation, surgical phase recognition, and localization tasks in endoscopy. By leveraging hierarchical pre-training, EndoMamba mitigates limited endoscopic data, enabling robust transfer from general-domain supervision to clinical tasks. Practically, this approach enables real-time, generalized endoscopic video understanding suitable for intraoperative guidance and automated analysis.

Abstract

Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, we propose a self-supervised hierarchical pre-training diagram to enhance EndoMamba's representation learning using endoscopic videos and incorporating general video domain knowledge. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks--classification, segmentation, surgical phase recognition, and localization--demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code is available at https://github.com/TianCuteQY/EndoMamba.

Paper Structure

This paper contains 10 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Proposed EndoMamba model for endoscopic video analysis. (a) EndoMamba backbone structure with efficient recurrent inference ability, and (b) hierarchical pre-training diagram for enhanced representation learning.
  • Figure 2: Example frames of the pre-training data in MIX12.
  • Figure 3: Relative performance on classification F1 and segmentation Dice score, with gradual performance gains from pre-training data scaling and the addition of a teacher model.