Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models
Jinhwan Seo, Yoonki Cho, Junhyug Noh, Sung-eui Yoon
TL;DR
The paper addresses Grounded Video Question Answering (GVQA), which requires not only answering questions about video content but grounding answers visually and tracking referenced objects over time. It introduces a three-stage approach—Video Reasoning & QA, Spatio-temporal Grounding, and Tracking—with a Trigger Moment anchor derived from a Chain-of-Reasoning prompt (CORTEX) to initialize grounding. Key contributions include the Trigger Moment mechanism, OCR-adapted grounding, sliding-window and reverse-play fallbacks, and bidirectional tracking with SAM2, achieving a highest GVQA HOTA of $0.4968$ on the test set, well above the prior $0.2704$. The results demonstrate that leveraging the QA model's reasoning to select a precise anchor frame improves grounding accuracy and temporal consistency, offering a practical, zero-shot solution for grounded multimodal reasoning with potential extension to multi-event queries.
Abstract
In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.
