Table of Contents
Fetching ...

Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation

Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch

TL;DR

This work tackles occlusion and long-term identity preservation in surgical instrument segmentation by extending SAM3 with a training-free ReMeDI framework. It introduces a dual-memory design (relevance-aware and occlusion-aware), a memory expansion via piecewise temporal-encoding interpolation, and a feature-based re-identification module with temporal voting to correct identities after disocclusions. The approach yields substantial zero-shot improvements on EndoVis17 and EndoVis18, notably increasing mcIoU and reducing false positives, without retraining. These advances enhance robustness and reliability of instrument tracking in challenging endoscopic videos, supporting better intraoperative guidance without domain-specific fine-tuning.

Abstract

Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.

Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation

TL;DR

This work tackles occlusion and long-term identity preservation in surgical instrument segmentation by extending SAM3 with a training-free ReMeDI framework. It introduces a dual-memory design (relevance-aware and occlusion-aware), a memory expansion via piecewise temporal-encoding interpolation, and a feature-based re-identification module with temporal voting to correct identities after disocclusions. The approach yields substantial zero-shot improvements on EndoVis17 and EndoVis18, notably increasing mcIoU and reducing false positives, without retraining. These advances enhance robustness and reliability of instrument tracking in challenging endoscopic videos, supporting better intraoperative guidance without domain-specific fine-tuning.

Abstract

Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.

Paper Structure

This paper contains 25 sections, 6 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: SAM3 vs ReMeDI-SAM3 (ours). The orange-labeled instrument gets occluded after $T=153$ and re-appears at $T=165$. While SAM3 produces false positives after its re-appearance, ReMeDI-SAM3 maintains consistent identities across occlusion and re-entry.
  • Figure 2: ReMeDI-SAM3 pipeline. We extend SAM3 with a dual-memory design and a feature-based re-identification module. For each instrument, the memory is divided into a relevance-aware memory that stores high-confidence entries and an occlusion-aware memory that is populated upon disocclusion using lower-confidence pre-occlusion frames drawn from an Unconditional Buffer that stores all past frames (\ref{['sec:memories']}). When disocclusion is detected (tool reappears), occlusion-aware memory is first updated, after which feature-based ReID module verifies or reassigns the predicted identity using a multi-scale feature bank (\ref{['sec:re-id']}).
  • Figure 3: Visualization of temporal positional encodings and memory expansion strategies. Left: select channels of original temporal positional embeddings. Mid: uniform interpolation distributes new positions evenly over entire temporal range. Right: piecewise interpolation preserves boundary embeddings and samples new positions only in interior region.
  • Figure 4: Qualitative comparison of SAM3 and ReMeDI-SAM3. ReMeDI-SAM3 preserves identity across long occlusions and re-entries, while SAM3 shows identity confusion.
  • Figure 5: Qualitative comparison of SAM3 and ReMeDI-SAM3 on EndoVis17.
  • ...and 4 more figures