Table of Contents
Fetching ...

Adaptive Self-training Framework for Fine-grained Scene Graph Generation

Kibum Kim, Kanghoon Yoon, Yeonjun In, Jinyoung Moon, Donghyun Kim, Chanyoung Park

TL;DR

This work tackles the challenge of long-tailed and sparsely annotated scene graph data by introducing ST-SGG, a model-agnostic self-training framework that leverages unannotated triplets through iterative pseudo-labeling. Central to ST-SGG are CATM, which uses per-class adaptive thresholds with EMA and class-specific momentum to balance learning across predicate classes, and GSL, which enriches the scene graph structure to improve confident pseudo-labeling for MPNN-based SGG models. Empirical results on Visual Genome and Open Images V6 show substantial gains in fine-grained predicate metrics (mR@K, F@K) while preserving performance on head predicates; ablations confirm the importance of EMA, class-specific momentum, and graph structure learning. Overall, ST-SGG provides a scalable, effective approach to mitigating missing annotations and long-tail biases in SGG, with practical impact for downstream scene understanding tasks and improved zero-shot generalization.

Abstract

Scene graph generation (SGG) models have suffered from inherent problems regarding the benchmark datasets such as the long-tailed predicate distribution and missing annotation problems. In this work, we aim to alleviate the long-tailed problem of SGG by utilizing unannotated triplets. To this end, we introduce a Self-Training framework for SGG (ST-SGG) that assigns pseudo-labels for unannotated triplets based on which the SGG models are trained. While there has been significant progress in self-training for image recognition, designing a self-training framework for the SGG task is more challenging due to its inherent nature such as the semantic ambiguity and the long-tailed distribution of predicate classes. Hence, we propose a novel pseudo-labeling technique for SGG, called Class-specific Adaptive Thresholding with Momentum (CATM), which is a model-agnostic framework that can be applied to any existing SGG models. Furthermore, we devise a graph structure learner (GSL) that is beneficial when adopting our proposed self-training framework to the state-of-the-art message-passing neural network (MPNN)-based SGG models. Our extensive experiments verify the effectiveness of ST-SGG on various SGG models, particularly in enhancing the performance on fine-grained predicate classes.

Adaptive Self-training Framework for Fine-grained Scene Graph Generation

TL;DR

This work tackles the challenge of long-tailed and sparsely annotated scene graph data by introducing ST-SGG, a model-agnostic self-training framework that leverages unannotated triplets through iterative pseudo-labeling. Central to ST-SGG are CATM, which uses per-class adaptive thresholds with EMA and class-specific momentum to balance learning across predicate classes, and GSL, which enriches the scene graph structure to improve confident pseudo-labeling for MPNN-based SGG models. Empirical results on Visual Genome and Open Images V6 show substantial gains in fine-grained predicate metrics (mR@K, F@K) while preserving performance on head predicates; ablations confirm the importance of EMA, class-specific momentum, and graph structure learning. Overall, ST-SGG provides a scalable, effective approach to mitigating missing annotations and long-tail biases in SGG, with practical impact for downstream scene understanding tasks and improved zero-shot generalization.

Abstract

Scene graph generation (SGG) models have suffered from inherent problems regarding the benchmark datasets such as the long-tailed predicate distribution and missing annotation problems. In this work, we aim to alleviate the long-tailed problem of SGG by utilizing unannotated triplets. To this end, we introduce a Self-Training framework for SGG (ST-SGG) that assigns pseudo-labels for unannotated triplets based on which the SGG models are trained. While there has been significant progress in self-training for image recognition, designing a self-training framework for the SGG task is more challenging due to its inherent nature such as the semantic ambiguity and the long-tailed distribution of predicate classes. Hence, we propose a novel pseudo-labeling technique for SGG, called Class-specific Adaptive Thresholding with Momentum (CATM), which is a model-agnostic framework that can be applied to any existing SGG models. Furthermore, we devise a graph structure learner (GSL) that is beneficial when adopting our proposed self-training framework to the state-of-the-art message-passing neural network (MPNN)-based SGG models. Our extensive experiments verify the effectiveness of ST-SGG on various SGG models, particularly in enhancing the performance on fine-grained predicate classes.
Paper Structure (43 sections, 3 equations, 14 figures, 8 tables, 2 algorithms)

This paper contains 43 sections, 3 equations, 14 figures, 8 tables, 2 algorithms.

Figures (14)

  • Figure 1: (a) Problems in VG scene graph dataset, and (b) self-training framework for SGG.
  • Figure 2: (a) Performance of Motif and its self-trained models with various thresholding techniques. (b) The number of pseudo-labels per predicate class when Motif-$\tau_c^{\text{cls}}$ is trained on VG.
  • Figure 3: Impact of applying GSL on the model confidence of BGNN Li2021bgnn.
  • Figure 4: (a) Performance comparison per class. The black line indicates the number of pseudo-labeled instances. (b) Performance comparison on head, body, and tail predicate classes.
  • Figure 5: (a) Adaptive threshold values over iterations, and (b) examples of pseudo-labels of ST-SGG and IE-Trans assigned to the different cases. Conf. denotes the confidence, i.e., $\hat{q}$.
  • ...and 9 more figures