Table of Contents
Fetching ...

Panoptic Scene Graph Generation

Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, Ziwei Liu

TL;DR

This work introduces Panoptic Scene Graph Generation (PSG), a paradigm that grounds scene graphs on panoptic segmentations rather than bounding boxes to capture both objects and background context. It builds a large PSG dataset (COCO+VG overlap) with 133 object classes and 56 predicates, and provides a comprehensive benchmark including four two-stage baselines and two one-stage baselines (PSGTR and PSGFormer) based on DETR. Key findings show one-stage models can be highly competitive and unbiased (PSGFormer), while end-to-end triplet-based approaches (PSGTR) reach state-of-the-art results given longer training; two-stage methods benefit from high-quality segmentation, illustrating the interplay between segmentation and relation reasoning. The work outlines open challenges and provides a foundation for future research in richer scene understanding and downstream tasks like visual reasoning and segmentation-guided image generation.

Abstract

Existing research addresses scene graph generation (SGG) -- a critical technology for scene understanding in images -- from a detection perspective, i.e., objects are detected using bounding boxes followed by prediction of their pairwise relationships. We argue that such a paradigm causes several problems that impede the progress of the field. For instance, bounding box-based labels in current datasets usually contain redundant classes like hairs, and leave out background information that is crucial to the understanding of context. In this work, we introduce panoptic scene graph generation (PSG), a new problem task that requires the model to generate a more comprehensive scene graph representation based on panoptic segmentations rather than rigid bounding boxes. A high-quality PSG dataset, which contains 49k well-annotated overlapping images from COCO and Visual Genome, is created for the community to keep track of its progress. For benchmarking, we build four two-stage baselines, which are modified from classic methods in SGG, and two one-stage baselines called PSGTR and PSGFormer, which are based on the efficient Transformer-based detector, i.e., DETR. While PSGTR uses a set of queries to directly learn triplets, PSGFormer separately models the objects and relations in the form of queries from two Transformer decoders, followed by a prompting-like relation-object matching mechanism. In the end, we share insights on open challenges and future directions.

Panoptic Scene Graph Generation

TL;DR

This work introduces Panoptic Scene Graph Generation (PSG), a paradigm that grounds scene graphs on panoptic segmentations rather than bounding boxes to capture both objects and background context. It builds a large PSG dataset (COCO+VG overlap) with 133 object classes and 56 predicates, and provides a comprehensive benchmark including four two-stage baselines and two one-stage baselines (PSGTR and PSGFormer) based on DETR. Key findings show one-stage models can be highly competitive and unbiased (PSGFormer), while end-to-end triplet-based approaches (PSGTR) reach state-of-the-art results given longer training; two-stage methods benefit from high-quality segmentation, illustrating the interplay between segmentation and relation reasoning. The work outlines open challenges and provides a foundation for future research in richer scene understanding and downstream tasks like visual reasoning and segmentation-guided image generation.

Abstract

Existing research addresses scene graph generation (SGG) -- a critical technology for scene understanding in images -- from a detection perspective, i.e., objects are detected using bounding boxes followed by prediction of their pairwise relationships. We argue that such a paradigm causes several problems that impede the progress of the field. For instance, bounding box-based labels in current datasets usually contain redundant classes like hairs, and leave out background information that is crucial to the understanding of context. In this work, we introduce panoptic scene graph generation (PSG), a new problem task that requires the model to generate a more comprehensive scene graph representation based on panoptic segmentations rather than rigid bounding boxes. A high-quality PSG dataset, which contains 49k well-annotated overlapping images from COCO and Visual Genome, is created for the community to keep track of its progress. For benchmarking, we build four two-stage baselines, which are modified from classic methods in SGG, and two one-stage baselines called PSGTR and PSGFormer, which are based on the efficient Transformer-based detector, i.e., DETR. While PSGTR uses a set of queries to directly learn triplets, PSGFormer separately models the objects and relations in the form of queries from two Transformer decoders, followed by a prompting-like relation-object matching mechanism. In the end, we share insights on open challenges and future directions.
Paper Structure (43 sections, 8 equations, 13 figures, 2 tables)

This paper contains 43 sections, 8 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Scene graph generation (a. SGG task) vs. panoptic scene graph generation (b. PSG task). The existing SGG task in (a) uses bounding box-based labels, which are often inaccurate---pixels covered by a bounding box do not necessarily belong to the annotated class---and cannot fully capture the background information. In contrast, the proposed PSG task in (b) presents a more comprehensive and clean scene graph representation, with more accurate localization of objects and including relationships with the background (known as stuff), i.e., the trees and pavement.
  • Figure 1: Comparsion between classic SGG datasets and PSG dataset. #PPI counts predicates per image. DupFree checks whether duplicated object groundings are cleaned up. Spvn indicates whether the objects are grounded by bounding boxes or segmentations.
  • Figure 2: Word Cloud for PSG Predicates.
  • Figure 3: Two-stage PSG baselines using Panoptic FPN.a) In stage one, for each thing/stuff object, Panoptic FPN kirillov2019panoptic produces a segmentation mask with its tightest bounding box to crop out the object feature. The union of relevant objects can produce relation features. b) In the second stage, the extracted object and relation features are fed into by any existing SGG relation model to predict the relation triplets.
  • Figure 4: PSGTR: One-stage PSG baseline. The one-stage model takes in a) features extracted by CNNs with positional encoding, and a set of queries aiming to represent triplets. b) Query learning block processes image features with Transformer encoder-decoder and use queries to represent triplet information. Then, c) the PSG prediction head concretes the triplet predictions by producing subject/object/predicate classes using simple FFNs, and uses panoptic heads for panoptic segmentation.
  • ...and 8 more figures