Table of Contents
Fetching ...

Scene Graph Generation with Role-Playing Large Language Models

Guikun Chen, Jin Li, Wenguan Wang

TL;DR

SDSGG is a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content, and is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene.

Abstract

Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP and follow a standard zero-shot pipeline -- computing similarity between the query image and the text embeddings for each category (i.e., text classifiers). In this work, we argue that the text classifiers adopted by existing OVSGG methods, i.e., category-/part-level prompts, are scene-agnostic as they remain unchanged across contexts. Using such fixed text classifiers not only struggles to model visual relations with high variance, but also falls short in adapting to distinct contexts. To plug these intrinsic shortcomings, we devise SDSGG, a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content. In particular, to generate comprehensive and diverse descriptions oriented to the scene, an LLM is asked to play different roles (e.g., biologist and engineer) to analyze and discuss the descriptive features of a given scene from different views. Unlike previous efforts simply treating the generated descriptions as mutually equivalent text classifiers, SDSGG is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene (this is what the term "specific" means). Furthermore, to capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter. It refines CLIP's ability to recognize relations by learning an interaction-aware semantic space. Extensive experiments on prevalent benchmarks show that SDSGG outperforms top-leading methods by a clear margin.

Scene Graph Generation with Role-Playing Large Language Models

TL;DR

SDSGG is a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content, and is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene.

Abstract

Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP and follow a standard zero-shot pipeline -- computing similarity between the query image and the text embeddings for each category (i.e., text classifiers). In this work, we argue that the text classifiers adopted by existing OVSGG methods, i.e., category-/part-level prompts, are scene-agnostic as they remain unchanged across contexts. Using such fixed text classifiers not only struggles to model visual relations with high variance, but also falls short in adapting to distinct contexts. To plug these intrinsic shortcomings, we devise SDSGG, a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content. In particular, to generate comprehensive and diverse descriptions oriented to the scene, an LLM is asked to play different roles (e.g., biologist and engineer) to analyze and discuss the descriptive features of a given scene from different views. Unlike previous efforts simply treating the generated descriptions as mutually equivalent text classifiers, SDSGG is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene (this is what the term "specific" means). Furthermore, to capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter. It refines CLIP's ability to recognize relations by learning an interaction-aware semantic space. Extensive experiments on prevalent benchmarks show that SDSGG outperforms top-leading methods by a clear margin.

Paper Structure

This paper contains 19 sections, 8 equations, 9 figures, 15 tables, 2 algorithms.

Figures (9)

  • Figure 1: Illustration of the used text classifiers in OVSGG. (a) CLIP performs zero-shot classification by computing similarity between the query image and the text embeddings for each category, then choosing the highest. (b) To further utilize the learned semantic space of CLIP, one can compute similarities of multiple part-level prompts (e.g., the object of $\langle$man, riding, horse$\rangle$ may be described with “with four legs” and “with a saddle”). (c) Instead of using these scene-agnostic text classifiers, SDSGG adopts comprehensive, scene-specific descriptions generated by LLMs, which can adapt to specific contexts by using the proposed renormalization.
  • Figure 2: (a) Overview of SDSGG. (b) Each text classifier of SDSGG contains a raw description $\tiny{\bm{d}_a^n}$ and an opposite description $\tiny{\bm{d}_p^n}$. As such, the self-normalized similarities can be computed with the association ($\tiny{C_{r}^n}$) between predicate categories and SSDs. (c) Given the visual features (i.e., $\tiny{\bm{f}_{s}^{img}}$, $\tiny{\bm{f}_{o}^{cls}}$, and $\bm{f}_{o}^{img}$) of both the subject and object extracted from CLIP's visual encoder, our mutual visual adapter (MVA) projects them into interaction-aware space and models their complicated interplay with cross-attention.
  • Figure 3: Visual results (§\ref{['exp:quali']}) on VG krishna2017visual.
  • Figure S1: Illustration of the generated scene-specific descriptions.
  • Figure S2: Prompts for initial description generation.
  • ...and 4 more figures