Table of Contents
Fetching ...

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan

TL;DR

This work introduces Caption-to-PSG, a challenging problem of learning panoptic scene graphs purely from textual descriptions, and proposes TextPSG, a four-module framework (region grouper, entity grounder, segment merger, label generator) that jointly learns segmentation, grounding, and open-vocabulary labeling from image-caption data. The entity grounder uses fine-grained contrastive learning to align image regions with caption entities, while the segment merger uses ground-truth-like pseudo labels to learn segment similarity; the label generator employs an auto-regressive, language-model-based decoder with a prompt-embedding technique (PET) to predict object semantics and relation predicates. TextPSG achieves strong improvements and robust out-of-distribution performance on Caption-to-PSG benchmarks, and the grounder and merger components also benefit text-supervised semantic segmentation. By leveraging large-scale caption data and a language-model-based labeling approach, this method broadens PSG applicability to open-set concepts and relations, reducing annotation burden and enabling richer scene understanding in real-world settings.

Abstract

Panoptic Scene Graph has recently been proposed for comprehensive scene understanding. However, previous works adopt a fully-supervised learning manner, requiring large amounts of pixel-wise densely-annotated data, which is always tedious and expensive to obtain. To address this limitation, we study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG). The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs. The problem is very challenging for three constraints: 1) no location priors; 2) no explicit links between visual regions and textual entities; and 3) no pre-defined concept sets. To tackle this problem, we propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques. The region grouper first groups image pixels into different segments and the entity grounder then aligns visual segments with language entities based on the textual description of the segment being referred to. The grounding results can thus serve as pseudo labels enabling the segment merger to learn the segment similarity as well as guiding the label generator to learn object semantics and relation predicates, resulting in a fine-grained structured scene understanding. Our framework is effective, significantly outperforming the baselines and achieving strong out-of-distribution robustness. We perform comprehensive ablation studies to corroborate the effectiveness of our design choices and provide an in-depth analysis to highlight future directions. Our code, data, and results are available on our project page: https://textpsg.github.io/.

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

TL;DR

This work introduces Caption-to-PSG, a challenging problem of learning panoptic scene graphs purely from textual descriptions, and proposes TextPSG, a four-module framework (region grouper, entity grounder, segment merger, label generator) that jointly learns segmentation, grounding, and open-vocabulary labeling from image-caption data. The entity grounder uses fine-grained contrastive learning to align image regions with caption entities, while the segment merger uses ground-truth-like pseudo labels to learn segment similarity; the label generator employs an auto-regressive, language-model-based decoder with a prompt-embedding technique (PET) to predict object semantics and relation predicates. TextPSG achieves strong improvements and robust out-of-distribution performance on Caption-to-PSG benchmarks, and the grounder and merger components also benefit text-supervised semantic segmentation. By leveraging large-scale caption data and a language-model-based labeling approach, this method broadens PSG applicability to open-set concepts and relations, reducing annotation burden and enabling richer scene understanding in real-world settings.

Abstract

Panoptic Scene Graph has recently been proposed for comprehensive scene understanding. However, previous works adopt a fully-supervised learning manner, requiring large amounts of pixel-wise densely-annotated data, which is always tedious and expensive to obtain. To address this limitation, we study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG). The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs. The problem is very challenging for three constraints: 1) no location priors; 2) no explicit links between visual regions and textual entities; and 3) no pre-defined concept sets. To tackle this problem, we propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques. The region grouper first groups image pixels into different segments and the entity grounder then aligns visual segments with language entities based on the textual description of the segment being referred to. The grounding results can thus serve as pseudo labels enabling the segment merger to learn the segment similarity as well as guiding the label generator to learn object semantics and relation predicates, resulting in a fine-grained structured scene understanding. Our framework is effective, significantly outperforming the baselines and achieving strong out-of-distribution robustness. We perform comprehensive ablation studies to corroborate the effectiveness of our design choices and provide an in-depth analysis to highlight future directions. Our code, data, and results are available on our project page: https://textpsg.github.io/.
Paper Structure (32 sections, 22 equations, 5 figures, 9 tables)

This paper contains 32 sections, 22 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Problem Overview. Different from the traditional bbox-based form of the scene graph as shown in (a), Caption-to-PSG aims to generate the mask-based panoptic scene graph. In Caption-to-PSG, the model has no access to any location priors, explicit region-entity links, or pre-defined concept sets. Consequently, the model is required to learn partitioning and grounding as illustrated in (b), as well as object semantics and relation predicates as illustrated in (c), all purely from textual descriptions.
  • Figure 2: Framework Overview of TextPSG. The framework consists of four modules cooperating with each other: a region grouper to merge regions in the input image into several segments, an entity grounder to ground entities in the caption onto the image segments, a segment merger to learn similarity matrices to merge small image segments during inference, and a label generator to learn the prediction of object semantics and relation predicates. The solid arrows indicate the training flow, while the dash arrows indicate the inference flow. The arrows from the region grouper to the label generator indicating the segment feature and mask query are omitted.
  • Figure 3: Qualitative Comparison between SGGNLS-o (a) and Ours (b). For each method, the results of object location are shown on the left, while the results of scene graph generation are shown on the right. For Ours, scene graphs predicted within the given concept sets are provided in the middle column, and scene graphs directly predicted through the auto-regressive generation (i.e., an open-vocabulary manner) in the label generator are additionally provided in the right column.
  • Figure 4: Region-Entity Alignment Results of Captions in Different Granularity. Two captions in different granularity are used to execute region-entity alignment with the same image, with (a) one describing the two sheep individually while (b) the other merges them in plural form.
  • Figure 5: More Qualitative Comparison between SGGNLS-o (a) and Ours (b). For each method, the results of object location are shown on the left, while the results of scene graph generation are shown on the right. For SGGNLS-o and Ours, the visualized relations are picked from the top 10 triplets in the scene graph (the predicate score should be greater than 0.6). For SGGNLS-o, only proposals matched with ground truth (only requires a correct location, ignores the semantics) are visualized.