Adaptive Self-training Framework for Fine-grained Scene Graph Generation

Kibum Kim; Kanghoon Yoon; Yeonjun In; Jinyoung Moon; Donghyun Kim; Chanyoung Park

Adaptive Self-training Framework for Fine-grained Scene Graph Generation

Kibum Kim, Kanghoon Yoon, Yeonjun In, Jinyoung Moon, Donghyun Kim, Chanyoung Park

TL;DR

This work tackles the challenge of long-tailed and sparsely annotated scene graph data by introducing ST-SGG, a model-agnostic self-training framework that leverages unannotated triplets through iterative pseudo-labeling. Central to ST-SGG are CATM, which uses per-class adaptive thresholds with EMA and class-specific momentum to balance learning across predicate classes, and GSL, which enriches the scene graph structure to improve confident pseudo-labeling for MPNN-based SGG models. Empirical results on Visual Genome and Open Images V6 show substantial gains in fine-grained predicate metrics (mR@K, F@K) while preserving performance on head predicates; ablations confirm the importance of EMA, class-specific momentum, and graph structure learning. Overall, ST-SGG provides a scalable, effective approach to mitigating missing annotations and long-tail biases in SGG, with practical impact for downstream scene understanding tasks and improved zero-shot generalization.

Abstract

Scene graph generation (SGG) models have suffered from inherent problems regarding the benchmark datasets such as the long-tailed predicate distribution and missing annotation problems. In this work, we aim to alleviate the long-tailed problem of SGG by utilizing unannotated triplets. To this end, we introduce a Self-Training framework for SGG (ST-SGG) that assigns pseudo-labels for unannotated triplets based on which the SGG models are trained. While there has been significant progress in self-training for image recognition, designing a self-training framework for the SGG task is more challenging due to its inherent nature such as the semantic ambiguity and the long-tailed distribution of predicate classes. Hence, we propose a novel pseudo-labeling technique for SGG, called Class-specific Adaptive Thresholding with Momentum (CATM), which is a model-agnostic framework that can be applied to any existing SGG models. Furthermore, we devise a graph structure learner (GSL) that is beneficial when adopting our proposed self-training framework to the state-of-the-art message-passing neural network (MPNN)-based SGG models. Our extensive experiments verify the effectiveness of ST-SGG on various SGG models, particularly in enhancing the performance on fine-grained predicate classes.

Adaptive Self-training Framework for Fine-grained Scene Graph Generation

TL;DR

Abstract

Paper Structure (43 sections, 3 equations, 14 figures, 8 tables, 2 algorithms)

This paper contains 43 sections, 3 equations, 14 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Self-Training Framework for SGG (ST-SGG)
Preliminaries
Problem Formulation of ST-SGG
Challenges of Self-Training for SGG
Class-specific Adaptive Thresholding with Momentum (CATM)
Class-specific Adaptive Thresholding
Class-specific Momentum
Graph Structure Learner for Confident Pseudo-labels
Experiment
Comparison with Baselines on Visual Genome
Comparison with MPNN-based Models on Visual Genome
Ablation Study on Model Components of ST-SGG
Qualitative Analysis on CATM
...and 28 more sections

Figures (14)

Figure 1: (a) Problems in VG scene graph dataset, and (b) self-training framework for SGG.
Figure 2: (a) Performance of Motif and its self-trained models with various thresholding techniques. (b) The number of pseudo-labels per predicate class when Motif-$\tau_c^{\text{cls}}$ is trained on VG.
Figure 3: Impact of applying GSL on the model confidence of BGNN Li2021bgnn.
Figure 4: (a) Performance comparison per class. The black line indicates the number of pseudo-labeled instances. (b) Performance comparison on head, body, and tail predicate classes.
Figure 5: (a) Adaptive threshold values over iterations, and (b) examples of pseudo-labels of ST-SGG and IE-Trans assigned to the different cases. Conf. denotes the confidence, i.e., $\hat{q}$.
...and 9 more figures

Adaptive Self-training Framework for Fine-grained Scene Graph Generation

TL;DR

Abstract

Adaptive Self-training Framework for Fine-grained Scene Graph Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)