Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency
Hyeongjin Kim, Sangwon Kim, Dasom Ahn, Jong Taek Lee, Byoung Chul Ko
TL;DR
This work tackles two key challenges in scene graph generation: modeling object co-occurrence and mitigating long-tail biases in predicate distributions. It introduces CooK, a co-occurrence knowledge matrix learned from data, and a learnable TF-$l$-IDF layer that reweights and updates node features to balance head and tail classes, both of which can be integrated into existing MPNN-based SGG models. Experimental results on VG and OI show consistent improvements across PredCls, SGCls, and SGGen, with notable gains in long-tail scenarios and when combining both components (CooK + TF-$l$-IDF). The approach demonstrates strong generalization across backbone models and datasets, suggesting practical impact for more robust and unbiased scene understanding in vision-language tasks.
Abstract
Scene graph generation (SGG) is an important task in image understanding because it represents the relationships between objects in an image as a graph structure, making it possible to understand the semantic relationships between objects intuitively. Previous SGG studies used a message-passing neural networks (MPNN) to update features, which can effectively reflect information about surrounding objects. However, these studies have failed to reflect the co-occurrence of objects during SGG generation. In addition, they only addressed the long-tail problem of the training dataset from the perspectives of sampling and learning methods. To address these two problems, we propose CooK, which reflects the Co-occurrence Knowledge between objects, and the learnable term frequency-inverse document frequency (TF-l-IDF) to solve the long-tail problem. We applied the proposed model to the SGG benchmark dataset, and the results showed a performance improvement of up to 3.8% compared with existing state-of-the-art models in SGGen subtask. The proposed method exhibits generalization ability from the results obtained, showing uniform performance improvement for all MPNN models.
