Table of Contents
Fetching ...

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Shengqiong Wu, Hao Fei, Jingkang Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, Tat-seng Chua

TL;DR

This work tackles the data scarcity and open-vocabulary challenges in 4D panoptic scene graph (4D-PSG) generation by introducing a 4D Large Language Model (4D-LLM) backbone paired with a 3D mask decoder for end-to-end 4D-PSG output. A chained inference mechanism leverages the LLM's open-vocabulary capabilities to iteratively refine object and relation labels, while a 2D-to-4D visual scene transfer (D2→4-VST) framework transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, mitigating data scarcity. The framework progresses through four learning steps, culminating in large-scale visual scene transfer from 2D SG datasets to train the 4D-LLM and end-to-end fine-tuning on 4D data. Experimental results on PSG4D GTA and HOI datasets show substantial gains over baselines, with strong open-vocabulary and zero-shot performance, demonstrating the method’s effectiveness for scalable, real-world 4D scene understanding with potential robotic and simulation applications.

Abstract

The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can primarily suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

TL;DR

This work tackles the data scarcity and open-vocabulary challenges in 4D panoptic scene graph (4D-PSG) generation by introducing a 4D Large Language Model (4D-LLM) backbone paired with a 3D mask decoder for end-to-end 4D-PSG output. A chained inference mechanism leverages the LLM's open-vocabulary capabilities to iteratively refine object and relation labels, while a 2D-to-4D visual scene transfer (D2→4-VST) framework transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, mitigating data scarcity. The framework progresses through four learning steps, culminating in large-scale visual scene transfer from 2D SG datasets to train the 4D-LLM and end-to-end fine-tuning on 4D data. Experimental results on PSG4D GTA and HOI datasets show substantial gains over baselines, with strong open-vocabulary and zero-shot performance, demonstrating the method’s effectiveness for scalable, real-world 4D scene understanding with potential robotic and simulation applications.

Abstract

The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can primarily suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.

Paper Structure

This paper contains 51 sections, 9 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: (a) Illustration of 4D-PSG, (b) SG dataset statistics, and (c) motivation for 2D scene transfer learning.
  • Figure 2: Overview of 2D-to-4D visual scene transfer learning mechanisms for 4D-PSG generation, including 4 key steps.
  • Figure 3: Illustrations of 2D-to-4D visual scene transfer learning.
  • Figure 4: The feature similarity distribution between predicted and gold ones without (a) and with (b) step 2.
  • Figure 5: An instance for comparing 4D-LLM with/without chained SG inference mechanism.
  • ...and 6 more figures