Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
Shengqiong Wu, Hao Fei, Jingkang Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, Tat-seng Chua
TL;DR
This work tackles the data scarcity and open-vocabulary challenges in 4D panoptic scene graph (4D-PSG) generation by introducing a 4D Large Language Model (4D-LLM) backbone paired with a 3D mask decoder for end-to-end 4D-PSG output. A chained inference mechanism leverages the LLM's open-vocabulary capabilities to iteratively refine object and relation labels, while a 2D-to-4D visual scene transfer (D2→4-VST) framework transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, mitigating data scarcity. The framework progresses through four learning steps, culminating in large-scale visual scene transfer from 2D SG datasets to train the 4D-LLM and end-to-end fine-tuning on 4D data. Experimental results on PSG4D GTA and HOI datasets show substantial gains over baselines, with strong open-vocabulary and zero-shot performance, demonstrating the method’s effectiveness for scalable, real-world 4D scene understanding with potential robotic and simulation applications.
Abstract
The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can primarily suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.
