Table of Contents
Fetching ...

Tracking and Understanding Object Transformations

Yihong Sun, Xinyu Yang, Jennifer J. Sun, Bharath Hariharan

TL;DR

This work tackles tracking objects through state transformations by introducing Track Any State and TubeletGraph, a zero-shot framework that partitions a video into tubelets, recovers missing post-transformation objects using spatial and semantic priors, and constructs a state graph describing the transformations with GPT-4-based natural language descriptions. A new benchmark, VOST-TAS, extends VOST with explicit transformation annotations to evaluate both tracking and transformation understanding. The approach achieves state-of-the-art tracking under transformations on multiple datasets and demonstrates robust grounding and semantic reasoning for complex object changes, while providing qualitative evidence of recovering missing object parts and describing the transformation process. The results highlight the value of combining spatiotemporal partitioning, priors for candidate recovery, and vision-language reasoning to enable richer, more interpretable video understanding with potential impact in robotics and scene understanding, while noting computational cost and broader ethical considerations.

Abstract

Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.

Tracking and Understanding Object Transformations

TL;DR

This work tackles tracking objects through state transformations by introducing Track Any State and TubeletGraph, a zero-shot framework that partitions a video into tubelets, recovers missing post-transformation objects using spatial and semantic priors, and constructs a state graph describing the transformations with GPT-4-based natural language descriptions. A new benchmark, VOST-TAS, extends VOST with explicit transformation annotations to evaluate both tracking and transformation understanding. The approach achieves state-of-the-art tracking under transformations on multiple datasets and demonstrates robust grounding and semantic reasoning for complex object changes, while providing qualitative evidence of recovering missing object parts and describing the transformation process. The results highlight the value of combining spatiotemporal partitioning, priors for candidate recovery, and vision-language reasoning to enable richer, more interpretable video understanding with potential impact in robotics and scene understanding, while noting computational cost and broader ethical considerations.

Abstract

Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.

Paper Structure

This paper contains 42 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (top) Given a video and a object mask as prompt, TubeletGraph tracks the object consistently, while building a state graph for each detected transformation and its resulting effect. (bottom) Compared to existing object trackers (SAM2 sam2) or video Q&A systems (GPT-4 gpt4), TubeletGraph predicts complete object tracks while providing spatiotemporal grounding for the transformation.
  • Figure 2: Overview of the proposed TubeletGraph. (1) Given a video and an initial prompt object mask, we first partition the initial frame via CropFormer (CF) cropformer and track every region forward in time via SAM2 sam2. For each empty region at a later frame, we initiate a new track if an entity at that frame can match with it. In the end, we would obtain a spatiotemporal partition of the video. (2) For each later-emerged entity region, we reason about its proximity and semantic consistency with the prompt object and only recover regions that satisfy both. (3) For each recovered region, we prompt multi-modal LLMs to describe the transformation and resulting objects. (4) From this, TubeletGraph achieves consistent tracking of transformation objects while mapping every transformation and resulting regions in a state graph representation.
  • Figure 3: Qualitative Results on VOST val. We showcase TubeletGraph's tracking and state graph predictions on top, with comparisons against baselines at a particular ending frame at the bottom.
  • Figure 4: Examples of VOST-TAS.
  • Figure 5: Failure examples of TubeletGraph.