RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning
Kanghoon Yoon, Kibum Kim, Jaehyung Jeon, Yeonjun In, Donghyun Kim, Chanyoung Park
TL;DR
This work tackles the long-tailed and semantically ambiguous nature of scene graph generation by reframing SGG as multi-label classification with partial annotation. It introduces RA-SGG, a retrieval-augmented framework that uses a memory bank of relation embeddings to retrieve semantically similar instances, identify latent fine-grained predicates, and augment labels through inverse-propensity sampling paired with multi-prototype learning. The method demonstrates state-of-the-art improvements on Visual Genome and GQA, especially in tail-class performance as measured by $F@K$ and $mR@K$, while maintaining strong overall accuracy. The approach offers a practical path to richer, more nuanced scene graphs and highlights the value of memory-augmented, label-aware learning in structured vision tasks.
Abstract
Scene Graph Generation (SGG) research has suffered from two fundamental challenges: the long-tailed predicate distribution and semantic ambiguity between predicates. These challenges lead to a bias towards head predicates in SGG models, favoring dominant general predicates while overlooking fine-grained predicates. In this paper, we address the challenges of SGG by framing it as multi-label classification problem with partial annotation, where relevant labels of fine-grained predicates are missing. Under the new frame, we propose Retrieval-Augmented Scene Graph Generation (RA-SGG), which identifies potential instances to be multi-labeled and enriches the single-label with multi-labels that are semantically similar to the original label by retrieving relevant samples from our established memory bank. Based on augmented relations (i.e., discovered multi-labels), we apply multi-prototype learning to train our SGG model. Several comprehensive experiments have demonstrated that RA-SGG outperforms state-of-the-art baselines by up to 3.6% on VG and 5.9% on GQA, particularly in terms of F@K, showing that RA-SGG effectively alleviates the issue of biased prediction caused by the long-tailed distribution and semantic ambiguity of predicates.
