Table of Contents
Fetching ...

SCISSR: Scribble-Conditioned Interactive Surgical Segmentation and Refinement

Haonan Ping, Jian Jiang, Cheng Yuan, Qizhen Sun, Lv Wu, Yutong Ban

Abstract

Accurate segmentation of tissues and instruments in surgical scenes is annotation-intensive due to irregular shapes, thin structures, specularities, and frequent occlusions. While SAM models support point, box, and mask prompts, points are often too sparse and boxes too coarse to localize such challenging targets. We present SCISSR, a scribble-promptable framework for interactive surgical scene segmentation. It introduces a lightweight Scribble Encoder that converts freehand scribbles into dense prompt embeddings compatible with the mask decoder, enabling iterative refinement for a target object by drawing corrective strokes on error regions. Because all added modules (the Scribble Encoder, Spatial Gated Fusion, and LoRA adapters) interact with the backbone only through its standard embedding interfaces, the framework is not tied to a single model: we build on SAM 2 in this work, yet the same components transfer to other prompt-driven segmentation architectures such as SAM 3 without structural modification. To preserve pre-trained capabilities, we train only these lightweight additions while keeping the remaining backbone frozen. Experiments on EndoVis 2018 demonstrate strong in-domain performance, while evaluation on the out-of-distribution CholecSeg8k further confirms robustness across surgical domains. SCISSR achieves 95.41% Dice on EndoVis 2018 with five interaction rounds and 96.30% Dice on CholecSeg8k with three interaction rounds, outperforming iterative point prompting on both benchmarks.

SCISSR: Scribble-Conditioned Interactive Surgical Segmentation and Refinement

Abstract

Accurate segmentation of tissues and instruments in surgical scenes is annotation-intensive due to irregular shapes, thin structures, specularities, and frequent occlusions. While SAM models support point, box, and mask prompts, points are often too sparse and boxes too coarse to localize such challenging targets. We present SCISSR, a scribble-promptable framework for interactive surgical scene segmentation. It introduces a lightweight Scribble Encoder that converts freehand scribbles into dense prompt embeddings compatible with the mask decoder, enabling iterative refinement for a target object by drawing corrective strokes on error regions. Because all added modules (the Scribble Encoder, Spatial Gated Fusion, and LoRA adapters) interact with the backbone only through its standard embedding interfaces, the framework is not tied to a single model: we build on SAM 2 in this work, yet the same components transfer to other prompt-driven segmentation architectures such as SAM 3 without structural modification. To preserve pre-trained capabilities, we train only these lightweight additions while keeping the remaining backbone frozen. Experiments on EndoVis 2018 demonstrate strong in-domain performance, while evaluation on the out-of-distribution CholecSeg8k further confirms robustness across surgical domains. SCISSR achieves 95.41% Dice on EndoVis 2018 with five interaction rounds and 96.30% Dice on CholecSeg8k with three interaction rounds, outperforming iterative point prompting on both benchmarks.
Paper Structure (22 sections, 3 equations, 9 figures, 17 tables, 1 algorithm)

This paper contains 22 sections, 3 equations, 9 figures, 17 tables, 1 algorithm.

Figures (9)

  • Figure 1: Motivation for scribble prompts. Points provide sparse cues and boxes enclose large background regions, while scribbles outline the target with dense spatial coverage.
  • Figure 2: Overview of SCISSR. Track 1 encodes all accumulated scribbles as dense prompt embeddings for the mask decoder; Track 2 encodes only the latest correction and injects it into the Memory Attention query via Spatial Gated Fusion. The mask is iteratively refined across rounds ($R_0\rightarrow R_1\rightarrow R_2\rightarrow \cdots$). Blue: frozen SAM 2 components; green/yellow: trainable components.
  • Figure 3: Per-class IoU with incremental refinement gains. (a) EndoVis 2018 (contour, R0$\to$R4). (b) CholecSeg8k (OOD, R0$\to$R2). Numbers at bar tops: final IoU and total gain.
  • Figure 4: Qualitative visualization of feature changes from R0$\rightarrow$R1: SGF-induced query modification ($|F_q{-}F_{img}|$) and Memory-induced update ($|F_{mem}{-}F_{img}|$), alongside input scribbles and the refined R1 prediction.
  • Figure 5: Scribble strategy ablation on EndoVis 2018. Contour achieves the best R0 and R4 across both mIoU and mDice.
  • ...and 4 more figures