Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency

Hyeongjin Kim; Sangwon Kim; Dasom Ahn; Jong Taek Lee; Byoung Chul Ko

Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency

Hyeongjin Kim, Sangwon Kim, Dasom Ahn, Jong Taek Lee, Byoung Chul Ko

TL;DR

This work tackles two key challenges in scene graph generation: modeling object co-occurrence and mitigating long-tail biases in predicate distributions. It introduces CooK, a co-occurrence knowledge matrix learned from data, and a learnable TF-$l$-IDF layer that reweights and updates node features to balance head and tail classes, both of which can be integrated into existing MPNN-based SGG models. Experimental results on VG and OI show consistent improvements across PredCls, SGCls, and SGGen, with notable gains in long-tail scenarios and when combining both components (CooK + TF-$l$-IDF). The approach demonstrates strong generalization across backbone models and datasets, suggesting practical impact for more robust and unbiased scene understanding in vision-language tasks.

Abstract

Scene graph generation (SGG) is an important task in image understanding because it represents the relationships between objects in an image as a graph structure, making it possible to understand the semantic relationships between objects intuitively. Previous SGG studies used a message-passing neural networks (MPNN) to update features, which can effectively reflect information about surrounding objects. However, these studies have failed to reflect the co-occurrence of objects during SGG generation. In addition, they only addressed the long-tail problem of the training dataset from the perspectives of sampling and learning methods. To address these two problems, we propose CooK, which reflects the Co-occurrence Knowledge between objects, and the learnable term frequency-inverse document frequency (TF-l-IDF) to solve the long-tail problem. We applied the proposed model to the SGG benchmark dataset, and the results showed a performance improvement of up to 3.8% compared with existing state-of-the-art models in SGGen subtask. The proposed method exhibits generalization ability from the results obtained, showing uniform performance improvement for all MPNN models.

Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency

TL;DR

-IDF layer that reweights and updates node features to balance head and tail classes, both of which can be integrated into existing MPNN-based SGG models. Experimental results on VG and OI show consistent improvements across PredCls, SGCls, and SGGen, with notable gains in long-tail scenarios and when combining both components (CooK + TF-

-IDF). The approach demonstrates strong generalization across backbone models and datasets, suggesting practical impact for more robust and unbiased scene understanding in vision-language tasks.

Abstract

Paper Structure (25 sections, 13 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 13 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work
SGG Approaches
Long-Tail Problem Solving
Label Correlation
Cook + TF-$l$-IDF Recipe
Preliminaries
Co-occurrence Knowledge
Learnable TF-$l$-IDF Layer
Training Strategy
Inference
Experiment
Datasets
Evaluation Metrics
Implementation Details
...and 10 more sections

Figures (7)

Figure 1: A novel learning recipe for SGG. (a) shows the class distribution and scene graph results of SGG performed using a conventional MPNN-based method. The proposed CooK + TF-$l$-IDF layer can be easily applied to existing MPNN-based models, as shown in (b). By updating the features according to the knowledge of object co-occurrence and the label inverse frequency, as shown in (c), it is possible to generate accurate relations between objects and successfully alleviate the long-tail problem.
Figure 2: The whole training strategy of our proposed CooK + TF-$l$-IDF method. (a) In the MPNN process, we use the prior knowledge value $CooK(c_j | c_i)$ extracted from the training data to enable learning that reflects CooK. The 1-order node feature $n'$ generated in this way is used as an input to (b) TF-$l$-IDF, which can update features by considering the frequency between labels, to create a 2-order node feature $\hat{n}$. Finally, the 2-order node feature $\hat{n}$ that has undergone $L$ times of (a) and (b) processes is used to generate the final SG through the scene graph predictor in (c).
Figure 3: Difference in TF-$l$-IDF performance according to the batch size. As the proposed TF-$l$-IDF is performed in batches, it can be confirmed that the performance increases proportionally as the batch size increases.
Figure 4: TF-$l$-IDF effect on long-tail problem. The proposed TF-$l$-IDF successfully reduces the mR@100 for common labels in head and focuses more on rare labels in body and tail.
Figure 5: Visualization of CooK matrix. CooK above is a visualization of CooK that reflects object co-occurrence in VG dataset.
...and 2 more figures

Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency

TL;DR

Abstract

Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency

Authors

TL;DR

Abstract

Table of Contents

Figures (7)