Table of Contents
Fetching ...

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

Qi'ao Xu, Tianwen Qian, Yuqian Fu, Kailing Li, Yang Jiao, Jiacheng Zhang, Xiaoling Wang, Liang He

TL;DR

<3-5 sentence high-level summary> ToG-Bench introduces Task-oriented Spatio-Temporal Grounding in egocentric videos (T-STVG) and provides a dedicated benchmark to evaluate how well models localize task-relevant objects under goal-driven instructions. It uses a semi-automated, top-down annotation pipeline combining vision-language foundation models and human verification to create 100 ScanNet clips with 2,704 task instructions and 4,194 object grounding tubes. The benchmark emphasizes explicit-implicit dual grounding and one-to-many object associations, and it introduces task-level metrics that jointly assess recognition and spatio-temporal localization. Experiments with seven state-of-the-art MLLMs reveal substantial gaps in implicit reasoning and multi-object grounding, motivating further advances to bridge perception and action in embodied AI.

Abstract

A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) \textbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

TL;DR

<3-5 sentence high-level summary> ToG-Bench introduces Task-oriented Spatio-Temporal Grounding in egocentric videos (T-STVG) and provides a dedicated benchmark to evaluate how well models localize task-relevant objects under goal-driven instructions. It uses a semi-automated, top-down annotation pipeline combining vision-language foundation models and human verification to create 100 ScanNet clips with 2,704 task instructions and 4,194 object grounding tubes. The benchmark emphasizes explicit-implicit dual grounding and one-to-many object associations, and it introduces task-level metrics that jointly assess recognition and spatio-temporal localization. Experiments with seven state-of-the-art MLLMs reveal substantial gaps in implicit reasoning and multi-object grounding, motivating further advances to bridge perception and action in embodied AI.

Abstract

A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) \textbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..

Paper Structure

This paper contains 38 sections, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Illustration and comparison of existing STVG benchmarks and our ToG-Bench. While prior datasets (a–c) focus on object-centric, explicit, single-object grounding in either exocentric or egocentric videos, ToG-Bench (d) introduces a task-driven paradigm that supports task-oriented instructions, explicit-implicit dual grounding, and multi-object grounding, enabling robust evaluation of embodied agents.
  • Figure 2: Semi-automated Annotation Pipeline. The process consists of three top-down stages: (1) generating task-oriented instructions and corresponding object descriptions with MLLM, (2) grounding and tracking task-relevant objects via Grounding-DINO and SAM2, and (3) performing human verification and filtering.
  • Figure 3: Dataset characteristics of ToG-Bench. Top-left: Task distribution by type (explicit vs. implicit) and object count; Top-right: Object category frequency (top 40 categories); Bottom: Example grounding tubes for explicit (blue) and implicit (pink) tasks, highlighting contextual inference and multi-object grounding.
  • Figure 4: Video duration distribution: bars show video count per interval, line shows average duration within each bin.
  • Figure 5: Task-level performance of GPT-5 across video duration bins on ToG-Bench. Left: task accuracy (T-Acc). Middle: temporal grounding (T-m_tIoU). Right: spatial grounding (T-m_vIoU).
  • ...and 16 more figures