Compositional Feature Augmentation for Unbiased Scene Graph Generation

Lin Li; Guikun Chen; Jun Xiao; Yi Yang; Chunping Wang; Long Chen

Compositional Feature Augmentation for Unbiased Scene Graph Generation

Lin Li, Guikun Chen, Jun Xiao, Yi Yang, Chunping Wang, Long Chen

TL;DR

This paper tackles bias in Scene Graph Generation caused by long-tailed predicate distributions. It introduces Compositional Feature Augmentation (CFA), a model-agnostic framework that increases tail-predicate feature diversity through two augmentations: intrinsic-CFA (replacing entity features guided by cluster-based similarity) and extrinsic-CFA (mixing in context via mixup with context triplets). CFA uses a feature bank of tail-triplet representations and a hierarchical clustering scheme to enable plausible feature substitutions, along with a contrastive objective to maintain discriminability after augmentation. Extensive experiments on Visual Genome and GQA show CFA improves mean recall and tail performance while preserving head accuracy, across Motifs, VCTree, and Transformer backbones, establishing a new state-of-the-art trade-off. The method offers a practical, generalizable path to unbiased SGG with strong implications for downstream tasks relying on robust scene graphs.

Abstract

Scene Graph Generation (SGG) aims to detect all the visual relation triplets $<$\texttt{sub}, \texttt{pred}, \texttt{obj}$>$ in a given image. With the emergence of various advanced techniques for better utilizing both the intrinsic and extrinsic information in each relation triplet, SGG has achieved great progress over the recent years. However, due to the ubiquitous long-tailed predicate distributions, today's SGG models are still easily biased to the head predicates. Currently, the most prevalent debiasing solutions for SGG are re-balancing methods, \eg, changing the distributions of original training samples. In this paper, we argue that all existing re-balancing strategies fail to increase the diversity of the relation triplet features of each predicate, which is critical for robust SGG. To this end, we propose a novel Compositional Feature Augmentation (\textbf{CFA}) strategy, which is the first unbiased SGG work to mitigate the bias issue from the perspective of increasing the diversity of triplet features. Specifically, we first decompose each relation triplet feature into two components: intrinsic feature and extrinsic feature, which correspond to the intrinsic characteristics and extrinsic contexts of a relation triplet, respectively. Then, we design two different feature augmentation modules to enrich the feature diversity of original relation triplets by replacing or mixing up either their intrinsic or extrinsic features from other samples. Due to its model-agnostic nature, CFA can be seamlessly incorporated into various SGG frameworks. Extensive ablations have shown that CFA achieves a new state-of-the-art performance on the trade-off between different metrics.

Compositional Feature Augmentation for Unbiased Scene Graph Generation

TL;DR

Abstract

Scene Graph Generation (SGG) aims to detect all the visual relation triplets

\texttt{sub}, \texttt{pred}, \texttt{obj}

in a given image. With the emergence of various advanced techniques for better utilizing both the intrinsic and extrinsic information in each relation triplet, SGG has achieved great progress over the recent years. However, due to the ubiquitous long-tailed predicate distributions, today's SGG models are still easily biased to the head predicates. Currently, the most prevalent debiasing solutions for SGG are re-balancing methods, \eg, changing the distributions of original training samples. In this paper, we argue that all existing re-balancing strategies fail to increase the diversity of the relation triplet features of each predicate, which is critical for robust SGG. To this end, we propose a novel Compositional Feature Augmentation (\textbf{CFA}) strategy, which is the first unbiased SGG work to mitigate the bias issue from the perspective of increasing the diversity of triplet features. Specifically, we first decompose each relation triplet feature into two components: intrinsic feature and extrinsic feature, which correspond to the intrinsic characteristics and extrinsic contexts of a relation triplet, respectively. Then, we design two different feature augmentation modules to enrich the feature diversity of original relation triplets by replacing or mixing up either their intrinsic or extrinsic features from other samples. Due to its model-agnostic nature, CFA can be seamlessly incorporated into various SGG frameworks. Extensive ablations have shown that CFA achieves a new state-of-the-art performance on the trade-off between different metrics.

Paper Structure (13 sections, 13 equations, 7 figures, 4 tables)

This paper contains 13 sections, 13 equations, 7 figures, 4 tables.

Introduction
Related Work
Approach
Revisiting the Two-Stage SGG Baselines
CFA: Compositional Feature Augmentation
Intrinsic-CFA
Extrinsic-CFA
Training Objectives
Experiments
Experimental Settings and Details
Comparison with State-of-the-Arts
Ablation Studies
Conclusion and Future Work

Figures (7)

Figure 1: (a) The intrinsic and extrinsic information for SGG. The entity prediction is for the green box, and the predicate prediction is for the relation between the red and green boxes. (b) Illustration of the diversity of feature space and decision boundary between on and laying on before and after using re-balancing and CFA. Each sample denotes the corresponding visual triplet features.
Figure 2: (a) Intrinsic-CFA: Replacing the entity feature of tail predicate triplet dog-laying on-bed from dog to cat to enhance the intrinsic feature. (b) Extrnisc-CFA: Mixing up the feature of tail predicate triplet pillow-laying on-bed into the context of pillow-on-bed to enhance the extrinsic feature.
Figure 3: The illustration of unbiased SGG framework with CFA.
Figure 4: The pipeline of Intrinsic CFA (a) and Extrinsic CFA (b). The blue and green boxes represent operations on the query triplet features in Intrinsic-CFA and context triplet features in Extrinsic-CFA, respectively.
Figure 5: Illustration of the pattern and context similarity between entity categories cat and dog. a) White boxes are the behavior patterns common to two categories. b) Blue and red circles are entity categories that co-occur with cat and dog, respectively.
...and 2 more figures

Compositional Feature Augmentation for Unbiased Scene Graph Generation

TL;DR

Abstract

Compositional Feature Augmentation for Unbiased Scene Graph Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)