Table of Contents
Fetching ...

Towards Object-centric Understanding for Instructional Videos

Wenliang Guo, Yu Kong

TL;DR

This work advocates an object-centric approach to understanding procedural tasks in instructional videos, proposing Object-IVQA to benchmark fine-grained object-state reasoning with temporally grounded evidence. It introduces a modular, multi-agent framework that plans tool use, processes video, analyzes object states, and generates grounded natural-language answers, enabling multi-hop reasoning across disjoint segments. Empirical results show current LVLMs struggle with object-level recognition and temporal causality, while the proposed agent framework significantly improves answer quality and evidence localization. The work lays groundwork for future enhancements like self-supervised object-state representations and symbolic reasoning scaffolds to advance robust object-centric procedural understanding.

Abstract

Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.

Towards Object-centric Understanding for Instructional Videos

TL;DR

This work advocates an object-centric approach to understanding procedural tasks in instructional videos, proposing Object-IVQA to benchmark fine-grained object-state reasoning with temporally grounded evidence. It introduces a modular, multi-agent framework that plans tool use, processes video, analyzes object states, and generates grounded natural-language answers, enabling multi-hop reasoning across disjoint segments. Empirical results show current LVLMs struggle with object-level recognition and temporal causality, while the proposed agent framework significantly improves answer quality and evidence localization. The work lays groundwork for future enhancements like self-supervised object-state representations and symbolic reasoning scaffolds to advance robust object-centric procedural understanding.

Abstract

Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.

Paper Structure

This paper contains 19 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our object-centric video QA task, where the model must track object dynamics over time and produce both an answer and its supporting time spans as temporal evidence.
  • Figure 2: Overview of the data collection pipeline, which combines automatic video sampling and QA generation using LVLMs with human refinement, ensuring both efficiency and high-quality outputs while reducing the expensive purely manual annotation.
  • Figure 3: QA types in our benchmark dataset. Different from existing instructional datasets focusing on action analysis, our benchmark centers on object-level understanding to capture temporal and spatial dynamics, enabling benchmarking from diverse perspectives.
  • Figure 4: Dataset statistics of our Object-IVQA benchmark dataset.
  • Figure 5: Our agent framework decomposes video QA into planning, processing, analyzing, and generation agents, along with an example showing the generated tool-usage plan, intermediate results, and final answer.