Table of Contents
Fetching ...

Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

Jinhwan Seo, Yoonki Cho, Junhyug Noh, Sung-eui Yoon

TL;DR

The paper addresses Grounded Video Question Answering (GVQA), which requires not only answering questions about video content but grounding answers visually and tracking referenced objects over time. It introduces a three-stage approach—Video Reasoning & QA, Spatio-temporal Grounding, and Tracking—with a Trigger Moment anchor derived from a Chain-of-Reasoning prompt (CORTEX) to initialize grounding. Key contributions include the Trigger Moment mechanism, OCR-adapted grounding, sliding-window and reverse-play fallbacks, and bidirectional tracking with SAM2, achieving a highest GVQA HOTA of $0.4968$ on the test set, well above the prior $0.2704$. The results demonstrate that leveraging the QA model's reasoning to select a precise anchor frame improves grounding accuracy and temporal consistency, offering a practical, zero-shot solution for grounded multimodal reasoning with potential extension to multi-event queries.

Abstract

In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.

Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

TL;DR

The paper addresses Grounded Video Question Answering (GVQA), which requires not only answering questions about video content but grounding answers visually and tracking referenced objects over time. It introduces a three-stage approach—Video Reasoning & QA, Spatio-temporal Grounding, and Tracking—with a Trigger Moment anchor derived from a Chain-of-Reasoning prompt (CORTEX) to initialize grounding. Key contributions include the Trigger Moment mechanism, OCR-adapted grounding, sliding-window and reverse-play fallbacks, and bidirectional tracking with SAM2, achieving a highest GVQA HOTA of on the test set, well above the prior . The results demonstrate that leveraging the QA model's reasoning to select a precise anchor frame improves grounding accuracy and temporal consistency, offering a practical, zero-shot solution for grounded multimodal reasoning with potential extension to multi-event queries.

Abstract

In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.

Paper Structure

This paper contains 10 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: An overview of the proposed approach. The QA stage uses a CORTEX prompt with Gemini comanici2025gemini to produce fine-grained reasoning and identify a trigger moment, an optimal frame for localization. In the Grounding stage, Molmo deitke2025molmo localizes the object within this trigger moment to generate an initial bounding box. If grounding fails at trigger moment, the model employs fallback strategies, such as a sliding window and reverse play search, to find an alternative frame to initialize the grounding. Finally, SAM2 ravi2024sam produces the full trajectory by initiating from grounded point and applying bidirectional tracking.
  • Figure 2: OCR failure cases.
  • Figure 3: Comparision of temporal grounding (Left: Naive Sliding Window, Right: Trigger Moment).
  • Figure 4: Failure cases.
  • Figure 5: Complete template of Chain-of-Reasoning for Trigger-moment Extraction (CORTEX) prompt. CORTEX guides the MLLM to identify target objects, explain its reasoning process, and specify the optimal temporal moment for spatial grounding.