Table of Contents
Fetching ...

Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning

Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando

TL;DR

Know-Show presents a unified benchmark for spatio-temporal grounded reasoning in video-language models, combining action-level reasoning with precise spatial and temporal grounding across five scenarios. It reveals that state-of-the-art Video-LMs underperform on joint reasoning and grounding, and introduces GRAM, a training-free augmentation that selects relevant video evidence at each reasoning step and augments timing with explicit timestamp tokens. Across open and closed models, GRAM yields consistent gains, particularly in spatial grounding, but hand-object and fine-grained interactions remain challenging and indicate the need for stronger spatial supervision, relational modeling, and temporal supervision. The work also provides a detailed implementation and qualitative analyses, establishing Know-Show as a practical standard for interpretable and reliable multimodal reasoning in real-world video understanding.

Abstract

Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to "show what they know" and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the dataset and the code at https://github.com/LUNAProject22/Know-Show.

Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning

TL;DR

Know-Show presents a unified benchmark for spatio-temporal grounded reasoning in video-language models, combining action-level reasoning with precise spatial and temporal grounding across five scenarios. It reveals that state-of-the-art Video-LMs underperform on joint reasoning and grounding, and introduces GRAM, a training-free augmentation that selects relevant video evidence at each reasoning step and augments timing with explicit timestamp tokens. Across open and closed models, GRAM yields consistent gains, particularly in spatial grounding, but hand-object and fine-grained interactions remain challenging and indicate the need for stronger spatial supervision, relational modeling, and temporal supervision. The work also provides a detailed implementation and qualitative analyses, establishing Know-Show as a practical standard for interpretable and reliable multimodal reasoning in real-world video understanding.

Abstract

Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to "show what they know" and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the dataset and the code at https://github.com/LUNAProject22/Know-Show.

Paper Structure

This paper contains 23 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Challenges of current Video-Language Models in spatio-temporal grounded reasoning. Although models can capture coarse temporal order of actions or recognize objects, they often fail to correctly associate actions, humans, and objects with their precise spatial and temporal contexts. Unlike humans, who can reason as well as ground, most models cannot do both. To address this, we propose the Know-Show benchmark, designed to assess genuine spatio-temporal grounded reasoning, a crucial capability for real-world applications such as robotics and assistive AI.
  • Figure 2: Overview of the test scenarios in the Know-Show Benchmark. The benchmark consists of two main categories: (1) Action-Conditioned Spatial Grounded Reasoning, which evaluates a model’s ability to reason about and localize people, objects, and hands within specific action contexts. This category includes four subtypes: (a) person-grounded reasoning, (b) object-grounded reasoning, (c) person–object co-grounded reasoning, and (d) hand–object co-grounded reasoning. (2) Action-Conditioned Temporal Grounded Reasoning, which assesses the model’s ability to reason about temporal order and to localize actions in time. Scenarios (a)–(d) are derived from Video 1, while scenario (e), corresponding to category (2), is derived from Video 2.
  • Figure 3: Illustration of the decoding process in our GRAM plugin. The inputs consist of a video and a corresponding question. The video is first processed by Vision Encoder to produce video tokens, which are combined with text tokens and previously generated tokens to form the input sequence. During decoding, whenever the model encounters a token that marks the beginning of a new reasoning step, we aggregate the attention of that token across all VLM layers and heads by averaging. From this aggregated attention map, we select the top $N$ most attended video tokens. These selected tokens are then concatenated with the Decoder’s input embeddings and fed into the Decoder to generate the next token. This iterative process ensures that each reasoning step is spatially and temporally grounded in the video, facilitating Spatio-Temporal Grounded Reasoning.
  • Figure 4: Prompts used to evaluate the Video-LMs.
  • Figure 5: Qualitative results for Action Conditioned Person Grounded Reasoning.
  • ...and 5 more figures