Table of Contents
Fetching ...

RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning

Kanghoon Yoon, Kibum Kim, Jaehyung Jeon, Yeonjun In, Donghyun Kim, Chanyoung Park

TL;DR

This work tackles the long-tailed and semantically ambiguous nature of scene graph generation by reframing SGG as multi-label classification with partial annotation. It introduces RA-SGG, a retrieval-augmented framework that uses a memory bank of relation embeddings to retrieve semantically similar instances, identify latent fine-grained predicates, and augment labels through inverse-propensity sampling paired with multi-prototype learning. The method demonstrates state-of-the-art improvements on Visual Genome and GQA, especially in tail-class performance as measured by $F@K$ and $mR@K$, while maintaining strong overall accuracy. The approach offers a practical path to richer, more nuanced scene graphs and highlights the value of memory-augmented, label-aware learning in structured vision tasks.

Abstract

Scene Graph Generation (SGG) research has suffered from two fundamental challenges: the long-tailed predicate distribution and semantic ambiguity between predicates. These challenges lead to a bias towards head predicates in SGG models, favoring dominant general predicates while overlooking fine-grained predicates. In this paper, we address the challenges of SGG by framing it as multi-label classification problem with partial annotation, where relevant labels of fine-grained predicates are missing. Under the new frame, we propose Retrieval-Augmented Scene Graph Generation (RA-SGG), which identifies potential instances to be multi-labeled and enriches the single-label with multi-labels that are semantically similar to the original label by retrieving relevant samples from our established memory bank. Based on augmented relations (i.e., discovered multi-labels), we apply multi-prototype learning to train our SGG model. Several comprehensive experiments have demonstrated that RA-SGG outperforms state-of-the-art baselines by up to 3.6% on VG and 5.9% on GQA, particularly in terms of F@K, showing that RA-SGG effectively alleviates the issue of biased prediction caused by the long-tailed distribution and semantic ambiguity of predicates.

RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning

TL;DR

This work tackles the long-tailed and semantically ambiguous nature of scene graph generation by reframing SGG as multi-label classification with partial annotation. It introduces RA-SGG, a retrieval-augmented framework that uses a memory bank of relation embeddings to retrieve semantically similar instances, identify latent fine-grained predicates, and augment labels through inverse-propensity sampling paired with multi-prototype learning. The method demonstrates state-of-the-art improvements on Visual Genome and GQA, especially in tail-class performance as measured by and , while maintaining strong overall accuracy. The approach offers a practical path to richer, more nuanced scene graphs and highlights the value of memory-augmented, label-aware learning in structured vision tasks.

Abstract

Scene Graph Generation (SGG) research has suffered from two fundamental challenges: the long-tailed predicate distribution and semantic ambiguity between predicates. These challenges lead to a bias towards head predicates in SGG models, favoring dominant general predicates while overlooking fine-grained predicates. In this paper, we address the challenges of SGG by framing it as multi-label classification problem with partial annotation, where relevant labels of fine-grained predicates are missing. Under the new frame, we propose Retrieval-Augmented Scene Graph Generation (RA-SGG), which identifies potential instances to be multi-labeled and enriches the single-label with multi-labels that are semantically similar to the original label by retrieving relevant samples from our established memory bank. Based on augmented relations (i.e., discovered multi-labels), we apply multi-prototype learning to train our SGG model. Several comprehensive experiments have demonstrated that RA-SGG outperforms state-of-the-art baselines by up to 3.6% on VG and 5.9% on GQA, particularly in terms of F@K, showing that RA-SGG effectively alleviates the issue of biased prediction caused by the long-tailed distribution and semantic ambiguity of predicates.

Paper Structure

This paper contains 22 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of the label augmentation of RA-SGG in SGG. (a) Examples of the query relation instances and their retrieved relation instances in the embedding space. "GT" denotes the ground truth. (b) The ratio of retrieved instances that have the same/different predicate as the query relation in the training data. (c) Fine-grained predicate label augmentation.
  • Figure 2: RA-SGG first uses the relation embeddings for querying instances in the memory bank to retrieve visually and semantically similar instances. Then, it augments multi-labels by assigning pseudo-labels to potentially multi-labeled instances.
  • Figure 3: Per predicate comparison of RA-SGG with PE-Net on VG. The task is PredCls.
  • Figure 4: Per predicate comparison of RA-SGG with PE-Net on GQA. The task is PredCls. The predicates are sorted by the frequency.
  • Figure 5: Sensitivity analysis on hyperparameters. The dotted line represents the result of PE-Net.
  • ...and 2 more figures