Table of Contents
Fetching ...

SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

Shreyas C. Dhake, Jiayuan Huang, Runlong He, Danyal Z. Khan, Evangelos B. Mazomenos, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarak I. Hoque

TL;DR

This work tackles anticipatory reasoning in surgical VQA by introducing PitVQA-Anticipation, a large-scale, expert-annotated dataset designed for forward-looking questions about future phases, steps, instruments, and remaining time in endonasal pituitary surgery. It further proposes SurgAnt-ViVQA, a multimodal model that combines a temporally aware video encoder (bidirectional GRU) with a GRU-Gated Cross-Attention mechanism and a LoRA-fine-tuned GPT-2 backbone to fuse visual context with language at the token level. Across PitVQA-Anticipation and EndoVis18-VQA benchmarks, SurgAnt-ViVQA achieves state-of-the-art results, with ablations showing temporal recurrence and adaptive gating as key drivers of improvement and a frame-budget trade-off between fluency and numeric timing accuracy. By coupling temporal dynamics with fine-grained cross-modal fusion, the approach advances surgical VQA from retrospective scene description to proactive, clinically actionable anticipation, laying groundwork for real-time predictive assistance in constrained surgical workflows.

Abstract

Anticipating forthcoming surgical events is vital for real-time assistance in endonasal transsphenoidal pituitary surgery, where visibility is limited and workflow changes rapidly. Most visual question answering (VQA) systems reason on isolated frames with static vision language alignment, providing little support for forecasting next steps or instrument needs. Existing surgical VQA datasets likewise center on the current scene rather than the near future. We introduce PitVQA-Anticipation, the first VQA dataset designed for forward looking surgical reasoning. It comprises 33.5 hours of operative video and 734,769 question answer pairs built from temporally grouped clips and expert annotations across four tasks: predicting the future phase, next step, upcoming instrument, and remaining duration. We further propose SurgAnt-ViVQA, a video language model that adapts a large language model using a GRU Gated Temporal Cross-Attention module. A bidirectional GRU encodes frame to frame dynamics, while an adaptive gate injects visual context into the language stream at the token level. Parameter efficient fine tuning customizes the language backbone to the surgical domain. SurgAnt-ViVQA tested upon on PitVQA-Anticipation and EndoVis datasets, surpassing strong image and video based baselines. Ablations show that temporal recurrence and gated fusion drive most of the gains. A frame budget study indicates a trade-off: 8 frames maximize fluency, whereas 32 frames slightly reduce BLEU but improve numeric time estimation. By pairing a temporally aware encoder with fine grained gated cross-attention, SurgAnt-ViVQA advances surgical VQA from retrospective description to proactive anticipation. PitVQA-Anticipation offers a comprehensive benchmark for this setting and highlights the importance of targeted temporal modeling for reliable, future aware surgical assistance.

SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

TL;DR

This work tackles anticipatory reasoning in surgical VQA by introducing PitVQA-Anticipation, a large-scale, expert-annotated dataset designed for forward-looking questions about future phases, steps, instruments, and remaining time in endonasal pituitary surgery. It further proposes SurgAnt-ViVQA, a multimodal model that combines a temporally aware video encoder (bidirectional GRU) with a GRU-Gated Cross-Attention mechanism and a LoRA-fine-tuned GPT-2 backbone to fuse visual context with language at the token level. Across PitVQA-Anticipation and EndoVis18-VQA benchmarks, SurgAnt-ViVQA achieves state-of-the-art results, with ablations showing temporal recurrence and adaptive gating as key drivers of improvement and a frame-budget trade-off between fluency and numeric timing accuracy. By coupling temporal dynamics with fine-grained cross-modal fusion, the approach advances surgical VQA from retrospective scene description to proactive, clinically actionable anticipation, laying groundwork for real-time predictive assistance in constrained surgical workflows.

Abstract

Anticipating forthcoming surgical events is vital for real-time assistance in endonasal transsphenoidal pituitary surgery, where visibility is limited and workflow changes rapidly. Most visual question answering (VQA) systems reason on isolated frames with static vision language alignment, providing little support for forecasting next steps or instrument needs. Existing surgical VQA datasets likewise center on the current scene rather than the near future. We introduce PitVQA-Anticipation, the first VQA dataset designed for forward looking surgical reasoning. It comprises 33.5 hours of operative video and 734,769 question answer pairs built from temporally grouped clips and expert annotations across four tasks: predicting the future phase, next step, upcoming instrument, and remaining duration. We further propose SurgAnt-ViVQA, a video language model that adapts a large language model using a GRU Gated Temporal Cross-Attention module. A bidirectional GRU encodes frame to frame dynamics, while an adaptive gate injects visual context into the language stream at the token level. Parameter efficient fine tuning customizes the language backbone to the surgical domain. SurgAnt-ViVQA tested upon on PitVQA-Anticipation and EndoVis datasets, surpassing strong image and video based baselines. Ablations show that temporal recurrence and gated fusion drive most of the gains. A frame budget study indicates a trade-off: 8 frames maximize fluency, whereas 32 frames slightly reduce BLEU but improve numeric time estimation. By pairing a temporally aware encoder with fine grained gated cross-attention, SurgAnt-ViVQA advances surgical VQA from retrospective description to proactive anticipation. PitVQA-Anticipation offers a comprehensive benchmark for this setting and highlights the importance of targeted temporal modeling for reliable, future aware surgical assistance.

Paper Structure

This paper contains 14 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Sample question–answer (QA) pairs from the PitVQA-Anticipation dataset displays anticipatory queries regarding upcoming surgical phases, steps, instruments, and time durations.
  • Figure 2: The SurgAnt-ViVQA architecture processes text through the LLM’s tokenizer and embeddings, while the GRU-Gated Cross-Attention module fuses video features before passing the combined representation to the LLM, adapted via parameter-efficient fine-tuning (LoRA) hu2022lora.
  • Figure 3: 2 examples of sample questions and predictions from PitVQA-Anticipation dataset upon different models. Red text denotes wrong predictions and <PAD> represents the degenerate answers generated.