Table of Contents
Fetching ...

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

Thong Thanh Nguyen, Xiaobao Wu, Yi Bin, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu

TL;DR

The paper tackles temporal panoptic scene graph generation by addressing the limitations of motion-insensitive pooling over mask tubes. It introduces a motion-aware contrastive learning framework that pairs mask-tube representations based on shared motion patterns while contrasting them against temporally shuffled or hard same-video triplets, with optimal transport distance used to quantify tube similarity. The approach yields significant improvements over state-of-the-art methods on OpenPVSG and PSG4D datasets across natural and 4D inputs, especially for dynamic relations, and is supported by thorough ablations and qualitative analyses. This work demonstrates the practical impact of motion-centric representation learning for complex, temporally grounded scene understanding in vision systems.

Abstract

To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations. Existing methods encode entity masks tracked across temporal dimensions (mask tubes), then predict their relations with temporal pooling operation, which does not fully utilize the motion indicative of the entities' relation. To overcome this limitation, we introduce a contrastive representation learning framework that focuses on motion pattern for temporal scene graph generation. Firstly, our framework encourages the model to learn close representations for mask tubes of similar subject-relation-object triplets. Secondly, we seek to push apart mask tubes from their temporally shuffled versions. Moreover, we also learn distant representations for mask tubes belonging to the same video but different triplets. Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets.

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

TL;DR

The paper tackles temporal panoptic scene graph generation by addressing the limitations of motion-insensitive pooling over mask tubes. It introduces a motion-aware contrastive learning framework that pairs mask-tube representations based on shared motion patterns while contrasting them against temporally shuffled or hard same-video triplets, with optimal transport distance used to quantify tube similarity. The approach yields significant improvements over state-of-the-art methods on OpenPVSG and PSG4D datasets across natural and 4D inputs, especially for dynamic relations, and is supported by thorough ablations and qualitative analyses. This work demonstrates the practical impact of motion-centric representation learning for complex, temporally grounded scene understanding in vision systems.

Abstract

To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations. Existing methods encode entity masks tracked across temporal dimensions (mask tubes), then predict their relations with temporal pooling operation, which does not fully utilize the motion indicative of the entities' relation. To overcome this limitation, we introduce a contrastive representation learning framework that focuses on motion pattern for temporal scene graph generation. Firstly, our framework encourages the model to learn close representations for mask tubes of similar subject-relation-object triplets. Secondly, we seek to push apart mask tubes from their temporally shuffled versions. Moreover, we also learn distant representations for mask tubes belonging to the same video but different triplets. Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets.

Paper Structure

This paper contains 16 sections, 8 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: State-of-the-art IPS+T - Convolution yang2023panoptic exhibits high R@50 scores for static relations, e.g.on, sitting on, and standing on, than dynamic relations, e.g.kicking, running on, and opening. In contrast, our method can perform effectively on both static and dynamic relations.
  • Figure 2: Examples of temporal panoptic scene graph generation of state-of-the-art yang2023panopticyang20244d and our method.
  • Figure 3: Framework overview of contrastive learning for temporal scene graph generation.
  • Figure 4: Proposed strategy to select strong-motion tubes.
  • Figure 5: Ablation results on threshold $\gamma$.