Table of Contents
Fetching ...

VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation

Zijian Zhou, Miaojing Shi, Holger Caesar

TL;DR

VLPrompt addresses PSG by integrating vision-derived features with language-derived knowledge from LLMs through carefully designed prompts. It introduces RP- and RJ-prompts to produce complementary language embeddings and employs two decoders that interact with vision features via cross-attention, followed by a gating fusion to predict relations. The approach yields large gains on PSG datasets, particularly for rare relations, and demonstrates the value of language grounding in structured scene understanding. By leveraging pre-extracted LLM descriptions and end-to-end training of the prompter, VLPrompt offers a practical path to more accurate, language-informed scene graphs with broad downstream impact.

Abstract

Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations. Code is available at \url{https://github.com/franciszzj/TP-SIS}.

VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation

TL;DR

VLPrompt addresses PSG by integrating vision-derived features with language-derived knowledge from LLMs through carefully designed prompts. It introduces RP- and RJ-prompts to produce complementary language embeddings and employs two decoders that interact with vision features via cross-attention, followed by a gating fusion to predict relations. The approach yields large gains on PSG datasets, particularly for rare relations, and demonstrates the value of language grounding in structured scene understanding. By leveraging pre-extracted LLM descriptions and end-to-end training of the prompter, VLPrompt offers a practical path to more accurate, language-informed scene graphs with broad downstream impact.

Abstract

Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations. Code is available at \url{https://github.com/franciszzj/TP-SIS}.
Paper Structure (30 sections, 3 equations, 9 figures, 4 tables)

This paper contains 30 sections, 3 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Comparison between previous PSG methods and ours. Left: Images of "person cleaning elephant" in two different scenes, accompanied by snippets of descriptions about "person cleaning elephant" obtained from LLMs. Right: Previous vision-only models can predict the cleaning relation between the person and elephant in image 1, but often classify image 2's relation as riding due to the person's position on the back of the elephant. Our vision-language model, enriched with language information, precisely identifies the cleaning relation in both images.
  • Figure 2: The overall framework of VLPrompt, which comprises three components: the vision feature extractor, the language feature extractor and the vision-language prompter.
  • Figure 3: Visualization results of our VLPrompt. We show two examples. For each example, the top left displays the predicted segmentation results, the top right shows the top 10 predicted relation triplets (all are correct relation triplets), and bottom is the language snippet utilized for predicting the highlighted triplets in yellow.
  • Figure 4: Visualization results of our VLPrompt.
  • Figure 5: Visualization results of our VLPrompt.
  • ...and 4 more figures