Table of Contents
Fetching ...

MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation

Nicolás Ayobi, Alejandra Pérez-Rondón, Santiago Rodríguez, Pablo Arbeláez

TL;DR

The paper tackles surgical instrument segmentation in video by introducing MATIS, a two-stage fully transformer-based method that combines a masked attention region-proposal baseline with a temporal consistency module using video transformers. The approach leverages Mask2Former for localized region proposals and a TAPIR-inspired MViT-based module to fuse temporal context and refine mask classifications. It reports state-of-the-art performance on the Endovis 2017 and 2018 benchmarks, with substantial gains from temporal information and robust ablation analyses validating design choices. The method demonstrates high-quality segmentation masks and improved instrument identification, suggesting strong practical potential for robot-assisted-surgery scene understanding; code and pretrained models are publicly available. Overall, MATIS establishes a new benchmark for instrument segmentation by effectively integrating spatially precise mask proposals with long-range temporal reasoning.

Abstract

We propose Masked-Attention Transformers for Surgical Instrument Segmentation (MATIS), a two-stage, fully transformer-based method that leverages modern pixel-wise attention mechanisms for instrument segmentation. MATIS exploits the instance-level nature of the task by employing a masked attention module that generates and classifies a set of fine instrument region proposals. Our method incorporates long-term video-level information through video transformers to improve temporal consistency and enhance mask classification. We validate our approach in the two standard public benchmarks, Endovis 2017 and Endovis 2018. Our experiments demonstrate that MATIS' per-frame baseline outperforms previous state-of-the-art methods and that including our temporal consistency module boosts our model's performance further.

MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation

TL;DR

The paper tackles surgical instrument segmentation in video by introducing MATIS, a two-stage fully transformer-based method that combines a masked attention region-proposal baseline with a temporal consistency module using video transformers. The approach leverages Mask2Former for localized region proposals and a TAPIR-inspired MViT-based module to fuse temporal context and refine mask classifications. It reports state-of-the-art performance on the Endovis 2017 and 2018 benchmarks, with substantial gains from temporal information and robust ablation analyses validating design choices. The method demonstrates high-quality segmentation masks and improved instrument identification, suggesting strong practical potential for robot-assisted-surgery scene understanding; code and pretrained models are publicly available. Overall, MATIS establishes a new benchmark for instrument segmentation by effectively integrating spatially precise mask proposals with long-range temporal reasoning.

Abstract

We propose Masked-Attention Transformers for Surgical Instrument Segmentation (MATIS), a two-stage, fully transformer-based method that leverages modern pixel-wise attention mechanisms for instrument segmentation. MATIS exploits the instance-level nature of the task by employing a masked attention module that generates and classifies a set of fine instrument region proposals. Our method incorporates long-term video-level information through video transformers to improve temporal consistency and enhance mask classification. We validate our approach in the two standard public benchmarks, Endovis 2017 and Endovis 2018. Our experiments demonstrate that MATIS' per-frame baseline outperforms previous state-of-the-art methods and that including our temporal consistency module boosts our model's performance further.
Paper Structure (7 sections, 2 figures, 4 tables)

This paper contains 7 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Qualitative comparisons of MATIS with previous methods in the EndoVis 2017 and EndoVis 2018 datasets.
  • Figure 2: MATIS first leverages Mask2Former's mask2former meta-architecture (top) to compute a set of region proposals and their corresponding segment embeddings. MATIS' temporal consistency module (bottom) computes a sequence of spatio-temporal features that are pooled through time and linearly transformed with an MLP. The result is concatenated ($\bigodot$) with a linear transformation of the per-segment embeddings. Finally, a linear classifier predicts the final class for each region.