Table of Contents
Fetching ...

Learning Object State Changes in Videos: An Open-World Perspective

Zihui Xue, Kumar Ashutosh, Kristen Grauman

TL;DR

This work introduces VidOSC, an open-world framework for temporally localizing Object State Changes (OSCs) in videos and generalizing to unseen objects. It combines object-agnostic state prediction with temporal modeling and leverages text describe- tions and vision-language models as supervision to scale training without exhaustive labeling. The HowToChange dataset provides an unprecedented, long-tail OSC benchmark with 134 objects, 20 state transitions, and a substantial split between known and novel OSCs, enabling robust open-world evaluation. Empirical results show VidOSC surpasses closed-world and open-world baselines, with ablations confirming the benefits of a shared state vocabulary, temporal context, and object-centric features; qualitative analyses further reveal improved temporal coherence and interpretability of object relations. The work highlights the practical potential of text-VLM supervision and open-world design for fine-grained, temporally evolving object states in real-world videos, and points to future exploration of concurrent OSCs and enhanced spatial reasoning.

Abstract

Object State Changes (OSCs) are pivotal for video understanding. While humans can effortlessly generalize OSC understanding from familiar to unknown objects, current approaches are confined to a closed vocabulary. Addressing this gap, we introduce a novel open-world formulation for the video OSC problem. The goal is to temporally localize the three stages of an OSC -- the object's initial state, its transitioning state, and its end state -- whether or not the object has been observed during training. Towards this end, we develop VidOSC, a holistic learning approach that: (1) leverages text and vision-language models for supervisory signals to obviate manually labeling OSC training data, and (2) abstracts fine-grained shared state representations from objects to enhance generalization. Furthermore, we present HowToChange, the first open-world benchmark for video OSC localization, which offers an order of magnitude increase in the label space and annotation volume compared to the best existing benchmark. Experimental results demonstrate the efficacy of our approach, in both traditional closed-world and open-world scenarios.

Learning Object State Changes in Videos: An Open-World Perspective

TL;DR

This work introduces VidOSC, an open-world framework for temporally localizing Object State Changes (OSCs) in videos and generalizing to unseen objects. It combines object-agnostic state prediction with temporal modeling and leverages text describe- tions and vision-language models as supervision to scale training without exhaustive labeling. The HowToChange dataset provides an unprecedented, long-tail OSC benchmark with 134 objects, 20 state transitions, and a substantial split between known and novel OSCs, enabling robust open-world evaluation. Empirical results show VidOSC surpasses closed-world and open-world baselines, with ablations confirming the benefits of a shared state vocabulary, temporal context, and object-centric features; qualitative analyses further reveal improved temporal coherence and interpretability of object relations. The work highlights the practical potential of text-VLM supervision and open-world design for fine-grained, temporally evolving object states in real-world videos, and points to future exploration of concurrent OSCs and enhanced spatial reasoning.

Abstract

Object State Changes (OSCs) are pivotal for video understanding. While humans can effortlessly generalize OSC understanding from familiar to unknown objects, current approaches are confined to a closed vocabulary. Addressing this gap, we introduce a novel open-world formulation for the video OSC problem. The goal is to temporally localize the three stages of an OSC -- the object's initial state, its transitioning state, and its end state -- whether or not the object has been observed during training. Towards this end, we develop VidOSC, a holistic learning approach that: (1) leverages text and vision-language models for supervisory signals to obviate manually labeling OSC training data, and (2) abstracts fine-grained shared state representations from objects to enhance generalization. Furthermore, we present HowToChange, the first open-world benchmark for video OSC localization, which offers an order of magnitude increase in the label space and annotation volume compared to the best existing benchmark. Experimental results demonstrate the efficacy of our approach, in both traditional closed-world and open-world scenarios.
Paper Structure (27 sections, 1 equation, 16 figures, 9 tables)

This paper contains 27 sections, 1 equation, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Top: The video OSC objective is to temporally localize an object's three states (i.e., initial, transitioning, end). Bottom: OSCs naturally exhibit a long tail. Certain OSCs, such as melting butter or marshmallow, are frequently showcased in instructional videos while others like melting jaggery might be rarely seen. We introduce an innovative open-world formulation that requires extrapolating to novel objects never encountered during training.
  • Figure 2: Our proposed VidOSC framework: (a) Mining for OSC examples (Sec. \ref{['sec:pseudo_label']}): We leverage ASR transcriptions paired with videos and the capabilities of LLM to automatically mine OSC examples. (b) Pseudo Label Generation (Sec. \ref{['sec:pseudo_label']}): We utilize textual state descriptions and a VLM for supervisory signals during training; (c) Model Training (Sec. \ref{['sec:model_design']}): We develop a video model for object-agnostic state prediction. (d) Model Testing (Sec. \ref{['sec:problem_formulation']}): We propose an open-world formulation, evaluating on both known and novel OSCs. Notably, while we employ the text modality to guide model training, our model is purely video-based and requires no text input at the test phase, ensuring maximum flexibility and applicability. Ground truth for the test set is manually annotated.
  • Figure 3: Ground truth annotation distribution across 20 state transitions (top) and 134 objects (bottom) in HowToChange (Evaluation). In line with our open-world formulation, annotations cover a diverse range of object-state transition combinations, categorized into known and novel OSCs.
  • Figure 4: Top-1 frame predictions given by VidOSC for the initial, transitioning, and end states, on ChangeIt (open-world) (first 2 rows) and HowToChange (last 2 rows). VidOSC not only accurately localizes the three fine-grained states for known OSCs, but also generalizes this understanding to novel objects, such as cauliflower and capsicum, which are not observed during training.
  • Figure 5: Comparison of model predictions across a test video depicting the OSC of "slicing shallot" on HowToChange. The x-axis represents temporal progression through the video. VidOSC gives temporally smooth and coherent predictions that best align with the ground truth, significantly outperforming baselines in capturing the video's global temporal context.
  • ...and 11 more figures