Learning Object State Changes in Videos: An Open-World Perspective
Zihui Xue, Kumar Ashutosh, Kristen Grauman
TL;DR
This work introduces VidOSC, an open-world framework for temporally localizing Object State Changes (OSCs) in videos and generalizing to unseen objects. It combines object-agnostic state prediction with temporal modeling and leverages text describe- tions and vision-language models as supervision to scale training without exhaustive labeling. The HowToChange dataset provides an unprecedented, long-tail OSC benchmark with 134 objects, 20 state transitions, and a substantial split between known and novel OSCs, enabling robust open-world evaluation. Empirical results show VidOSC surpasses closed-world and open-world baselines, with ablations confirming the benefits of a shared state vocabulary, temporal context, and object-centric features; qualitative analyses further reveal improved temporal coherence and interpretability of object relations. The work highlights the practical potential of text-VLM supervision and open-world design for fine-grained, temporally evolving object states in real-world videos, and points to future exploration of concurrent OSCs and enhanced spatial reasoning.
Abstract
Object State Changes (OSCs) are pivotal for video understanding. While humans can effortlessly generalize OSC understanding from familiar to unknown objects, current approaches are confined to a closed vocabulary. Addressing this gap, we introduce a novel open-world formulation for the video OSC problem. The goal is to temporally localize the three stages of an OSC -- the object's initial state, its transitioning state, and its end state -- whether or not the object has been observed during training. Towards this end, we develop VidOSC, a holistic learning approach that: (1) leverages text and vision-language models for supervisory signals to obviate manually labeling OSC training data, and (2) abstracts fine-grained shared state representations from objects to enhance generalization. Furthermore, we present HowToChange, the first open-world benchmark for video OSC localization, which offers an order of magnitude increase in the label space and annotation volume compared to the best existing benchmark. Experimental results demonstrate the efficacy of our approach, in both traditional closed-world and open-world scenarios.
