Table of Contents
Fetching ...

Agentic Spatio-Temporal Grounding via Collaborative Reasoning

Heng Zhao, Yew-Soon Ong, Joey Tianyi Zhou

TL;DR

ASTG tackles Spatio-Temporal Video Grounding in zero-shot, training-free settings by decomposing the grounding task into collaborative Spatial and Temporal Reasoning Agents. The Spatial Reasoning Agent proposes candidate tubes on selected frames, while the Temporal Reasoning Agent validates and temporally localizes using visual prompts within a propose-and-evaluate loop, aided by a Candidate Memory and a Controller that manage memory and dialogue. This agentic, tool-augmented approach enables open-world generalization and reduces redundant frame-wise reasoning, achieving strong results on VidSTG and HC-STVG benchmarks that rival some fully supervised methods. The work demonstrates the practicality of autonomous, memory-guided reasoning for robust, zero-shot STVG in unconstrained video data.

Abstract

Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization. Weakly-supervised variants mitigate annotation costs but remain constrained by the dataset-level train-and-fit paradigm with an inferior performance. To address these challenges, we propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario. Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs) work collaboratively to retrieve the target tube in an autonomous and self-guided manner. Following a propose-and-evaluation paradigm, ASTG duly decouples spatio-temporal reasoning and automates the tube extraction, verification and temporal localization processes. With a dedicate visual memory and dialogue context, the retrieval efficiency is significantly enhanced. Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin and is comparable to some of the fully-supervised methods.

Agentic Spatio-Temporal Grounding via Collaborative Reasoning

TL;DR

ASTG tackles Spatio-Temporal Video Grounding in zero-shot, training-free settings by decomposing the grounding task into collaborative Spatial and Temporal Reasoning Agents. The Spatial Reasoning Agent proposes candidate tubes on selected frames, while the Temporal Reasoning Agent validates and temporally localizes using visual prompts within a propose-and-evaluate loop, aided by a Candidate Memory and a Controller that manage memory and dialogue. This agentic, tool-augmented approach enables open-world generalization and reduces redundant frame-wise reasoning, achieving strong results on VidSTG and HC-STVG benchmarks that rival some fully supervised methods. The work demonstrates the practicality of autonomous, memory-guided reasoning for robust, zero-shot STVG in unconstrained video data.

Abstract

Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization. Weakly-supervised variants mitigate annotation costs but remain constrained by the dataset-level train-and-fit paradigm with an inferior performance. To address these challenges, we propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario. Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs) work collaboratively to retrieve the target tube in an autonomous and self-guided manner. Following a propose-and-evaluation paradigm, ASTG duly decouples spatio-temporal reasoning and automates the tube extraction, verification and temporal localization processes. With a dedicate visual memory and dialogue context, the retrieval efficiency is significantly enhanced. Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin and is comparable to some of the fully-supervised methods.
Paper Structure (14 sections, 5 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: (a) Static spatial reasoning ignores dynamic temporal context. (b) Temporal reasoning with holistic video input cannot infer spatial information in a dense manner. (c) Joint-reasoning systems repeat spatial reasoning on each frame, leading to redundancy. (d) ASTG formulates STVG as a collaborative agentic process, where spatial candidates are proposed once and iteratively verified by Spatial and Temporal Reasoning Agents (SRA and TRA); leveraging tools such as candidate memory, object tracker and visual marker.
  • Figure 2: We propose the first Agentic Spatio-Temporal Grounding (ASTG) framework. Two agent modules Spatial Reasoning Agent (SRA) and Temporal Reasoning Agent (TRA) work in a collaborative manner to search and retrieve the correct target candidate tube $\mathcal{T}$. Spatial (red mask outlines) and temporal prompts (frames index, e.g. "#10" ) are applied visually on the frames ($V_{sp}$ and $V_{stp}$) to assist the agents in the reasoning process. A visual memory module $\mathcal{M}$ and a dialogue context module are also proposed to improve the retrieval efficiency and guide the agents' behavior adaptively.
  • Figure 3: Qualitative examples of the proposed ASTG method on HC-STVG v2.