Table of Contents
Fetching ...

4D Panoptic Scene Graph Generation

Jingkang Yang, Jun Cen, Wenxuan Peng, Shuai Liu, Fangzhou Hong, Xiangtai Li, Kaiyang Zhou, Qifeng Chen, Ziwei Liu

TL;DR

This work defines the 4D Panoptic Scene Graph (PSG-4D) to jointly model 4D perception and dynamic relationships, introducing a dataset with 3K RGB-D videos and 1M frames annotated for 4D panoptic segmentation and dynamic scene graphs. It presents PSG4DFormer, a two-stage Transformer-based architecture that first produces 4D panoptic segmentation (with RGB-D and point-cloud variants and a tracking stage) and then learns spatial-temporal relations to output a 4D scene graph. The authors provide extensive experiments demonstrating the method as a strong baseline for PSG-4D, analyze the importance of depth and temporal attention, and showcase a real-world robot application that integrates a large language model for planning and execution. This work advances dynamic, embodied AI by enabling precise 4D grounding and relational reasoning, with practical impact for robotics and interactive systems.

Abstract

We are living in a three-dimensional space while moving forward through a fourth dimension: time. To allow artificial intelligence to develop a comprehensive understanding of such a 4D environment, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent entities with precise location and status information, and edges, which capture the temporal relations. To facilitate research in this new area, we build a richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of 1M frames, each of which is labeled with 4D panoptic segmentation masks as well as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks along the time axis, and generate the corresponding scene graphs via a relation component. Extensive experiments on the new dataset show that our method can serve as a strong baseline for future research on PSG-4D. In the end, we provide a real-world application example to demonstrate how we can achieve dynamic scene understanding by integrating a large language model into our PSG-4D system.

4D Panoptic Scene Graph Generation

TL;DR

This work defines the 4D Panoptic Scene Graph (PSG-4D) to jointly model 4D perception and dynamic relationships, introducing a dataset with 3K RGB-D videos and 1M frames annotated for 4D panoptic segmentation and dynamic scene graphs. It presents PSG4DFormer, a two-stage Transformer-based architecture that first produces 4D panoptic segmentation (with RGB-D and point-cloud variants and a tracking stage) and then learns spatial-temporal relations to output a 4D scene graph. The authors provide extensive experiments demonstrating the method as a strong baseline for PSG-4D, analyze the importance of depth and temporal attention, and showcase a real-world robot application that integrates a large language model for planning and execution. This work advances dynamic, embodied AI by enabling precise 4D grounding and relational reasoning, with practical impact for robotics and interactive systems.

Abstract

We are living in a three-dimensional space while moving forward through a fourth dimension: time. To allow artificial intelligence to develop a comprehensive understanding of such a 4D environment, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent entities with precise location and status information, and edges, which capture the temporal relations. To facilitate research in this new area, we build a richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of 1M frames, each of which is labeled with 4D panoptic segmentation masks as well as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks along the time axis, and generate the corresponding scene graphs via a relation component. Extensive experiments on the new dataset show that our method can serve as a strong baseline for future research on PSG-4D. In the end, we provide a real-world application example to demonstrate how we can achieve dynamic scene understanding by integrating a large language model into our PSG-4D system.
Paper Structure (27 sections, 1 equation, 4 figures, 2 tables)

This paper contains 27 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Conceptual illustration of PSG-4D. PSG-4D is essentially a spatiotemporal representation capturing not only fine-grained semantics in image pixels (i.e., panoptic segmentation masks) but also the temporal relational information (i.e., scene graphs). In (a) and (b), the model abstracts information streaming in RGB-D videos into (i) nodes that represent entities with accurate location and status information and (ii) edges that encapsulate the temporal relations. Such a rich 4D representation serves as a bridge between the PSG-4D system and a large language model, which greatly facilitates the decision-making process, as illustrated in (c).
  • Figure 2: The Examples and Word Clouds of PSG-4D dataset. The PSG-4D dataset contains 2 subsets, including (a) PSG4D-GTA selected from the SAIL-VOS 3D sailvos3d dataset, and (b) PSG4D-HOI from HOI4D hoi4d dataset. We selected 4 frames of an example video from each subset. Each frame has aligned RGB and depth with panoptic segmentation annotation. The scene graph is annotated in the form of triplets. The word cloud for object and relation categories in each dataset is also represented.
  • Figure 3: Illustration of the PSG4DFormer pipeline. This unified pipeline supports both RGB-D and point cloud video inputs and is composed of two main components: 4D panoptic segmentation modeling and relation modeling. The first stage seeks to obtain the 4D panoptic segmentation mask for each object, along with its corresponding feature tube spanning the video length. This is accomplished with the aid of (a) frame-level panoptic segmentation and (b) a tracking model. The subsequent stage (c) employs a spatial-temporal transformer to predict pairwise relations based on all feature tubes derived from the first stage.
  • Figure 4: Demonstration of a Robot Deployed with the PSG-4D Model. The service robot interprets the RGB-D sequence shown in (a), where a man is seen drinking coffee and subsequently dropping the empty bottle on the ground. The robot processes this sequence, translating it into a 4D scene graph depicted in (b). This graph comprises a set of temporally stamped triplets, with each object associated with a panoptic mask, accurately grounding it in 3D space. The robot regularly updates its PSG4D to GPT-4, awaiting feedback and instructions. In this scenario, GPT-4 advises the robot to clean up the discarded bottle and remind the man about his action. This directive is translated into robot action, as visualized in (d).