Table of Contents
Fetching ...

Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

Yizhou Li, Shuyuan Yang, Jiaji Su, Zonghe Chua

TL;DR

The paper addresses how the source of gaze supervision—varying by expertise level and perceptual modality—affects learning-based surgical gaze perception in RMIS. It introduces a paired active-passive gaze dataset collected on the da Vinci SimNow across four drills, enabling controlled analysis of gaze transfer between active execution and passive observation. The authors compare frame-wise saliency models (MSI-Net vs SalGAN) under four dataset variants (IA, IP, NA, NP) and find that MSI-Net yields stable, interpretable predictions while SalGAN is unstable, with passive gaze providing substantial but imperfect transfer to active expert attention. The results support data-centric strategies for gaze supervision, suggesting crowd-sourced passive gaze can serve as a scalable proxy for expert perception in surgical coaching and perception modeling, albeit with predictable degradations and modality-specific biases.

Abstract

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.

Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

TL;DR

The paper addresses how the source of gaze supervision—varying by expertise level and perceptual modality—affects learning-based surgical gaze perception in RMIS. It introduces a paired active-passive gaze dataset collected on the da Vinci SimNow across four drills, enabling controlled analysis of gaze transfer between active execution and passive observation. The authors compare frame-wise saliency models (MSI-Net vs SalGAN) under four dataset variants (IA, IP, NA, NP) and find that MSI-Net yields stable, interpretable predictions while SalGAN is unstable, with passive gaze providing substantial but imperfect transfer to active expert attention. The results support data-centric strategies for gaze supervision, suggesting crowd-sourced passive gaze can serve as a scalable proxy for expert perception in surgical coaching and perception modeling, albeit with predictable degradations and modality-specific biases.

Abstract

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.
Paper Structure (29 sections, 14 equations, 4 figures, 3 tables)

This paper contains 29 sections, 14 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (A) Active gaze collection experimental setup in which video output from the da Vinci surgeon console is piped into a VR headset with native gaze-tracking functionality. da Vinci Surgical Simulation SimNow tasks: (B) Sea Spikes, (C) Ring Rollercoaster 1, (D) Knot-tying, and (E) Big Dipper Needle Driver.
  • Figure 2: Gaze metrics for active and passive demonstrations. (A) Fixation rate, (B) scanpath speed, (C) fixation ration, (D) convex hull. All metrics are time normalized over the duration of each demonstration.
  • Figure 3: Gaze overlap metrics across observer--source skill combinations. Bars indicate mean values and error bars denote standard deviation.
  • Figure 4: Selected heatmaps for ground truth (A) NA – novice active gaze, (B) NP – novice passive gaze, (C) IP – intermediate passive gaze, and MSI-Net model predictions trained on (D) NA (E) NP (F) IA (G) IP predictions.