SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement

Rulin Zhou; Guankun Wang; An Wang; Yujie Ma; Lixin Ouyang; Bolin Cui; Junyan Li; Chaowei Zhu; Mingyang Li; Ming Chen; Xiaopin Zhong; Peng Lu; Jiankun Wang; Xianming Liu; Hongliang Ren

SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement

Rulin Zhou, Guankun Wang, An Wang, Yujie Ma, Lixin Ouyang, Bolin Cui, Junyan Li, Chaowei Zhu, Mingyang Li, Ming Chen, Xiaopin Zhong, Peng Lu, Jiankun Wang, Xianming Liu, Hongliang Ren

TL;DR

SurgAtt-Tracker is proposed, a holistic framework that robustly tracks surgical attention by exploiting temporal coherence through proposal-level reranking and motion-aware refinement, rather than direct regression, enabling continuous and interpretable frame-wise FoV guidance.

Abstract

Accurate and stable field-of-view (FoV) guidance is critical for safe and efficient minimally invasive surgery, yet existing approaches often conflate visual attention estimation with downstream camera control or rely on direct object-centric assumptions. In this work, we formulate surgical attention tracking as a spatio-temporal learning problem and model surgeon focus as a dense attention heatmap, enabling continuous and interpretable frame-wise FoV guidance. We propose SurgAtt-Tracker, a holistic framework that robustly tracks surgical attention by exploiting temporal coherence through proposal-level reranking and motion-aware refinement, rather than direct regression. To support systematic training and evaluation, we introduce SurgAtt-1.16M, a large-scale benchmark with a clinically grounded annotation protocol that enables comprehensive heatmap-based attention analysis across procedures and institutions. Extensive experiments on multiple surgical datasets demonstrate that SurgAtt-Tracker consistently achieves state-of-the-art performance and strong robustness under occlusion, multi-instrument interference, and cross-domain settings. Beyond attention tracking, our approach provides a frame-wise FoV guidance signal that can directly support downstream robotic FoV planning and automatic camera control.

SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement

TL;DR

Abstract

Paper Structure (59 sections, 24 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 59 sections, 24 equations, 11 figures, 7 tables, 1 algorithm.

Introduction
Surgical Attention Tracking: Task and Dataset
Task Formulation
Data Construction: The SurgAtt-SZPH Subset
Annotation Protocol: From Discrete Intent to Continuous Attention
Benchmark Composition
Method
Overview
Architecture Details
Frozen Detector & Proposal Generation
Multi-Scale ROI Decoder
Attention Score Rerank Module
Motion-Aware Adaptive Refine Module
Training Objectives
Reranking losses
...and 44 more sections

Figures (11)

Figure 1: SurgAtt-Tracker enables AI-guided endoscope control by predicting an attention heatmap from raw endoscopic video.
Figure 2: Overview of the SurgAtt-1.16M dataset, illustrating its anatomical coverage, data sources, and unified organization across organs, procedures, and annotation types.
Figure 3: Overview of SurgAtt-Tracker. A frozen detector produces Top-$K$ proposals and multi-scale pyramid features, which are converted into box-aligned embeddings by the Multi-Scale ROI Decoder (B); the AS-Rerank module performs temporal proposal reranking to select the Top-1 attention region (C), and MAA-Refine further refines it using motion-aware geometry (D) and visual evidence to yield the attention heatmap $H_t$.
Figure 4: Qualitative comparison of attention heatmaps across diverse surgical scenarios: (A) single-instrument case; (B) multi-instrument without tissue interaction; (C) multi-instrument with tissue interaction; (D) multi-instrument with smoke interference.
Figure 5: Construction pipeline of the SurgAtt-SZPH dataset. Raw laparoscopic videos are curated into high-quality surgical clips via optical-flow–based operation analysis and expert screening. Videos are sampled at 25 fps and grouped into five representative surgical scenes. During annotation, surgeons mark attention regions with bounding boxes, which are converted into continuous attention heatmaps for supervision. The resulting dataset provides dense, high-fidelity attention annotations across diverse surgical scenarios.
...and 6 more figures

SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement

TL;DR

Abstract

SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement

Authors

TL;DR

Abstract

Table of Contents

Figures (11)