Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Jecia Z. Y. Mao; Francis X. Creighton; Russell H. Taylor; Manish Sahu

Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Jecia Z. Y. Mao, Francis X. Creighton, Russell H. Taylor, Manish Sahu

Abstract

We introduce a speech-guided embodied agent framework for video-guided skull base surgery that dynamically executes perception and image-guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real-time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image-guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of preoperative 3D models, monocular video-based estimation of the surgical tool pose, and support image guidance through real-time anatomical overlays.We evaluate the proposed system in video-guided skull base surgery scenarios and benchmark its tracking performance against a commercially available optical tracking system. Results demonstrate that speech-guided embodied agents can achieve competitive spatial accuracy while improving workflow integration and enabling rapid deployment of video-guided surgical systems.

Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Abstract

Paper Structure (20 sections, 57 equations, 14 figures, 3 tables)

This paper contains 20 sections, 57 equations, 14 figures, 3 tables.

Introduction
Related Work
Methodology
Surgical Tool Segmentation
Tip Point Tracking
Anatomy Segmentation
Anatomy Registration
Surgical Navigation
Spatial Pose Tracking
Fore-ground Conditioned Depth Retrieval
Pose Initialization
Cross-frames Tracking
Experiments and Results
Pose Tracking Accuracy
Tool-tip translation error
...and 5 more sections

Figures (14)

Figure 1: System overview of the embodied surgical agent. The surgeon interacts through a hands-free interface (speech-to-text) that issues high-level commands to the front end, which orchestrates live video streaming, tool/anatomy segmentation, pose tracking, and anatomy registration. Intermediate outputs (masks, pose hypotheses, and registered anatomy overlays) are persisted in a streaming memory and can be retrieved on demand to support iterative refinement and rapid task switching without disrupting the surgical workflow. The back end executes modular perception and geometry components---including promptable segmentation, temporal mask propagation, surgical navigation, and pose/registration solvers---to produce stable tool state and anatomy-aligned navigation overlays in the endoscopic view.
Figure 2: Speech-driven tool segmentation with streaming memory and catch-up propagation. A voice command triggers GSAM to generate candidate masks at time $t_0$. After the surgeon confirms a proposal, the selected mask is stored and used to seed propagation through buffered intermediate frames, producing an updated mask at the latest time $t_n$ so online tracking can resume without interrupting tool motion.
Figure 3: Interactive anatomy segmentation and refinement with event-triggered prompt retrieval. A voice command starts anatomy segmentation while the system buffers the drill-tip trajectory. When the surgeon says "Done," the stored trajectory is converted into spatial prompts for SAM to generate an initial mask. Refinement is triggered by a "Click" command over regions to remove, adding prompts for mask update. The final anatomy mask $M^{\mathrm{Anat}}$ seeds CUTIE for temporal propagation. Blue arrows denote segmentation-mode prompting and trajectory retrieval; purple arrows denote refinement interactions and SAM updates.
Figure 4: Interactive anatomy registration workflow using the embodied surgical agent. The surgeon initiates registration via a voice command ("Register the anatomy") and sequentially identifies four anatomical landmarks in the video stream ("landmark 1--4"). The embodied surgical agent records the selected 2D landmarks in the memory module and associates them with corresponding 3D anatomical mesh points. After collecting sufficient correspondences, the system performs pose estimation to compute the transformation between the camera and anatomy frames. The registered anatomy is then overlaid onto the live surgical scene, enabling consistent spatial alignment and visualization during the procedure.
Figure 5: Surgical navigation with registered anatomical overlay. After registration, the segmented anatomical structures are transformed into the camera frame and rendered directly onto the live surgical view. Color-coded regions denote critical structures (e.g., facial nerve, cochlear nerve, vestibular aqueduct), providing spatial context within the operative field. The highlighted region illustrates how the system enhances intraoperative awareness by overlaying anatomy onto the exposed surgical cavity, supporting precise and anatomy-aware tool manipulation.
...and 9 more figures

Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Abstract

Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Authors

Abstract

Table of Contents

Figures (14)