Table of Contents
Fetching ...

Decomposed Prototype Learning for Few-Shot Scene Graph Generation

Xingchen Li, Jun Xiao, Guikun Chen, Yinfu Feng, Yi Yang, An-an Liu, Long Chen

TL;DR

This work targets Few-Shot Scene Graph Generation by addressing the strong intra-class variance of predicates through Decomposed Prototype Learning (DPL). DPL builds a decomposed prototype space for each predicate using subject- and object-aware prototypes and learns query-adaptive representations via learnable prompts in vision-language models. It combines a Query Embedding Network and a Predicate Prototype Network to produce multiple prototypes per predicate, aggregating them with query-context-aware weights, and uses metric learning to classify relations. Extensive experiments on VG-25, VG-60, and GQA-50 show state-of-the-art performance on unseen predicates with good transfer, while analyses reveal the benefits of multiple prototypes, learnable prompts, and reweighting for robust few-shot generalization.

Abstract

Today's scene graph generation (SGG) models typically require abundant manual annotations to learn new predicate types. Therefore, it is difficult to apply them to real-world applications with massive uncommon predicate categories whose annotations are hard to collect. In this paper, we focus on Few-Shot SGG (FSSGG), which encourages SGG models to be able to quickly transfer previous knowledge and recognize unseen predicates well with only a few examples. However, current methods for FSSGG are hindered by the high intra-class variance of predicate categories in SGG: On one hand, each predicate category commonly has multiple semantic meanings under different contexts. On the other hand, the visual appearance of relation triplets with the same predicate differs greatly under different subject-object compositions. Such great variance of inputs makes it hard to learn generalizable representation for each predicate category with current few-shot learning (FSL) methods. However, we found that this intra-class variance of predicates is highly related to the composed subjects and objects. To model the intra-class variance of predicates with subject-object context, we propose a novel Decomposed Prototype Learning (DPL) model for FSSGG. Specifically, we first construct a decomposable prototype space to capture diverse semantics and visual patterns of subjects and objects for predicates by decomposing them into multiple prototypes. Afterwards, we integrate these prototypes with different weights to generate query-adaptive predicate representation with more reliable semantics for each query sample. We conduct extensive experiments and compare with various baseline methods to show the effectiveness of our method.

Decomposed Prototype Learning for Few-Shot Scene Graph Generation

TL;DR

This work targets Few-Shot Scene Graph Generation by addressing the strong intra-class variance of predicates through Decomposed Prototype Learning (DPL). DPL builds a decomposed prototype space for each predicate using subject- and object-aware prototypes and learns query-adaptive representations via learnable prompts in vision-language models. It combines a Query Embedding Network and a Predicate Prototype Network to produce multiple prototypes per predicate, aggregating them with query-context-aware weights, and uses metric learning to classify relations. Extensive experiments on VG-25, VG-60, and GQA-50 show state-of-the-art performance on unseen predicates with good transfer, while analyses reveal the benefits of multiple prototypes, learnable prompts, and reweighting for robust few-shot generalization.

Abstract

Today's scene graph generation (SGG) models typically require abundant manual annotations to learn new predicate types. Therefore, it is difficult to apply them to real-world applications with massive uncommon predicate categories whose annotations are hard to collect. In this paper, we focus on Few-Shot SGG (FSSGG), which encourages SGG models to be able to quickly transfer previous knowledge and recognize unseen predicates well with only a few examples. However, current methods for FSSGG are hindered by the high intra-class variance of predicate categories in SGG: On one hand, each predicate category commonly has multiple semantic meanings under different contexts. On the other hand, the visual appearance of relation triplets with the same predicate differs greatly under different subject-object compositions. Such great variance of inputs makes it hard to learn generalizable representation for each predicate category with current few-shot learning (FSL) methods. However, we found that this intra-class variance of predicates is highly related to the composed subjects and objects. To model the intra-class variance of predicates with subject-object context, we propose a novel Decomposed Prototype Learning (DPL) model for FSSGG. Specifically, we first construct a decomposable prototype space to capture diverse semantics and visual patterns of subjects and objects for predicates by decomposing them into multiple prototypes. Afterwards, we integrate these prototypes with different weights to generate query-adaptive predicate representation with more reliable semantics for each query sample. We conduct extensive experiments and compare with various baseline methods to show the effectiveness of our method.
Paper Structure (41 sections, 18 equations, 4 figures, 10 tables)

This paper contains 41 sections, 18 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: The illustration of FSSGG with 3-shots. For each predicate category, 3 support samples are provided with annotations (i.e., the bounding boxes, categories of subjects and objects, and their predicate categories). The target of FSSGG models is to detect the relation triplets in the query images.
  • Figure 2: (a) The prototype representation for each class in conventional few-shot image classification. (b) Due to the high intra-class variance in FSSGG, the predicate may have multiple prototypes in the latent space.
  • Figure 3: The overview of our Decomposed Prototype Learning Network. The whole framework consists of two main networks: 1) Query Embedding Network (QEN), which aims to generate embeddings of query samples, and 2) Predicate Prototype Network (PPN), which aims to generate query-adaptive predicate prototype representation of each target category for each input query sample. Then we estimate their distance with a metric learning method to perform classification.
  • Figure 4: Visualization of the assigned weights to the prototypes based on $5$-shot of support samples. The first column displays the query triplets and the target predicates (highlighted in green). The other columns display the support triplets and their assigned weights by DPL. The subjects and objects are drawn in red boxes and blue boxes respectively.