Table of Contents
Fetching ...

Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding

Jinghan Zhao, Yifei Huang, Feng Lu

TL;DR

This work tackles the semantic gap between abstract procedural descriptions and visual data in instructional videos by introducing the Task-Step-State (TSS) framework, which grounds procedures in observable object configurations through a state layer. It then employs a progressive pre-training curriculum that unfolds the hierarchy in the order Task → Step → State → Step → Task, using LLM-generated state descriptions and cross-modal pseudo-labels to supervise a lightweight adapter over frozen visual features. Extensive experiments on COIN and CrossTask demonstrate consistent improvements in task recognition, step recognition, and next-step forecasting, with ablations confirming the essential roles of state supervision and staged training. The approach offers a principled pathway to learn robust, procedural-aware video representations with strong potential across embodied AI and instructional understanding tasks.

Abstract

Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, 'task' and 'step' descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce 'states', i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive pre-training strategy that unfolds the TSS hierarchy, forcing the model to ground representations in states while associating them with steps and high-level tasks. Extensive experiments on the COIN and CrossTask datasets show that our method outperforms baseline models on multiple downstream tasks, including task recognition, step recognition, and next step prediction. Ablation studies show that introducing state supervision is a key driver of performance gains across all tasks. Additionally, our progressive pretraining strategy proves more effective than standard joint training, as it better enforces the intended hierarchical structure.

Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding

TL;DR

This work tackles the semantic gap between abstract procedural descriptions and visual data in instructional videos by introducing the Task-Step-State (TSS) framework, which grounds procedures in observable object configurations through a state layer. It then employs a progressive pre-training curriculum that unfolds the hierarchy in the order Task → Step → State → Step → Task, using LLM-generated state descriptions and cross-modal pseudo-labels to supervise a lightweight adapter over frozen visual features. Extensive experiments on COIN and CrossTask demonstrate consistent improvements in task recognition, step recognition, and next-step forecasting, with ablations confirming the essential roles of state supervision and staged training. The approach offers a principled pathway to learn robust, procedural-aware video representations with strong potential across embodied AI and instructional understanding tasks.

Abstract

Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, 'task' and 'step' descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce 'states', i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive pre-training strategy that unfolds the TSS hierarchy, forcing the model to ground representations in states while associating them with steps and high-level tasks. Extensive experiments on the COIN and CrossTask datasets show that our method outperforms baseline models on multiple downstream tasks, including task recognition, step recognition, and next step prediction. Ablation studies show that introducing state supervision is a key driver of performance gains across all tasks. Additionally, our progressive pretraining strategy proves more effective than standard joint training, as it better enforces the intended hierarchical structure.

Paper Structure

This paper contains 32 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: (a) Our method extends the conventional task-step hierarchy by adding a state layer, forming the Task-Step-State (TSS) framework, where state can anchor abstract procedures to visual evidence. (b) To unfold the structure in TSS, we introduce a progressive pre-training strategy by cycling through the order: task→step→state→step→task.
  • Figure 2: The TSS framework extends the task-step hierarchy with an LLM-generated state layer.
  • Figure 3: Overview. (a) First, an LLM generates state descriptions from the task and step in WikiHow, yielding the three-level TSS framework. We encode these texts with a frozen text encoder $\phi_{\text{pre\_t}}$ and encode video clips from HowTo100M with a frozen vision encoder$\phi_{\text{pre\_v}}$, producing text features $e_t$ and video features $e_v$. Cosine similarity between $e_t$ and $e_v$ creates pseudo labels at each level. (b) Second, the progressive pre-training strategy trains the vision model in five stages, each focused on one TSS layer. Every stage takes $e_v$, passes it through an adapter $\phi_2^{(i)}$ and a task-specific head $\sigma$, and is supervised by the corresponding pseudo label.
  • Figure 4: Ablation results Comparison. The value at position $(i, j)$ in the heat map denotes the fraction of 12 evaluation settings where configuration $i$ outperforms configuration $j$.
  • Figure 5: Different ways to fuse the three visual representations after adapter fine-tuning in the ablation study.
  • ...and 7 more figures