Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation
Thong Thanh Nguyen, Xiaobao Wu, Yi Bin, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu
TL;DR
The paper tackles temporal panoptic scene graph generation by addressing the limitations of motion-insensitive pooling over mask tubes. It introduces a motion-aware contrastive learning framework that pairs mask-tube representations based on shared motion patterns while contrasting them against temporally shuffled or hard same-video triplets, with optimal transport distance used to quantify tube similarity. The approach yields significant improvements over state-of-the-art methods on OpenPVSG and PSG4D datasets across natural and 4D inputs, especially for dynamic relations, and is supported by thorough ablations and qualitative analyses. This work demonstrates the practical impact of motion-centric representation learning for complex, temporally grounded scene understanding in vision systems.
Abstract
To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations. Existing methods encode entity masks tracked across temporal dimensions (mask tubes), then predict their relations with temporal pooling operation, which does not fully utilize the motion indicative of the entities' relation. To overcome this limitation, we introduce a contrastive representation learning framework that focuses on motion pattern for temporal scene graph generation. Firstly, our framework encourages the model to learn close representations for mask tubes of similar subject-relation-object triplets. Secondly, we seek to push apart mask tubes from their temporally shuffled versions. Moreover, we also learn distant representations for mask tubes belonging to the same video but different triplets. Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets.
