Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning
Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando
TL;DR
Know-Show presents a unified benchmark for spatio-temporal grounded reasoning in video-language models, combining action-level reasoning with precise spatial and temporal grounding across five scenarios. It reveals that state-of-the-art Video-LMs underperform on joint reasoning and grounding, and introduces GRAM, a training-free augmentation that selects relevant video evidence at each reasoning step and augments timing with explicit timestamp tokens. Across open and closed models, GRAM yields consistent gains, particularly in spatial grounding, but hand-object and fine-grained interactions remain challenging and indicate the need for stronger spatial supervision, relational modeling, and temporal supervision. The work also provides a detailed implementation and qualitative analyses, establishing Know-Show as a practical standard for interpretable and reliable multimodal reasoning in real-world video understanding.
Abstract
Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to "show what they know" and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the dataset and the code at https://github.com/LUNAProject22/Know-Show.
