VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation
Zijian Zhou, Miaojing Shi, Holger Caesar
TL;DR
VLPrompt addresses PSG by integrating vision-derived features with language-derived knowledge from LLMs through carefully designed prompts. It introduces RP- and RJ-prompts to produce complementary language embeddings and employs two decoders that interact with vision features via cross-attention, followed by a gating fusion to predict relations. The approach yields large gains on PSG datasets, particularly for rare relations, and demonstrates the value of language grounding in structured scene understanding. By leveraging pre-extracted LLM descriptions and end-to-end training of the prompter, VLPrompt offers a practical path to more accurate, language-informed scene graphs with broad downstream impact.
Abstract
Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations. Code is available at \url{https://github.com/franciszzj/TP-SIS}.
