Table of Contents
Fetching ...

Ensemble Predicate Decoding for Unbiased Scene Graph Generation

Jiasong Feng, Lichun Wang, Hongbo Xu, Kai Xu, Baocai Yin

TL;DR

This work tackles predicate bias in scene graph generation caused by long-tail distributions and semantic overlap among predicates. It introduces Ensemble Predicate Decoding, a model-agnostic approach that uses a main decoder plus two auxiliary decoders trained on distinct predicate-frequency subsets to expand discriminative capacity, especially for infrequent and semantically similar predicates. By ensembling predictions from multiple decoders and using a carefully balanced loss with data partitions, the method significantly improves tail predicate accuracy (mR@K) with minimal sacrifice to head predicates and preserves overall performance across VG benchmarks. The approach demonstrates robust gains across baselines like Motifs and VCTree, and is supported by extensive ablations and hyperparameter analyses that validate the design choices and practical impact for unbiased SGG.

Abstract

Scene Graph Generation (SGG) aims to generate a comprehensive graphical representation that accurately captures the semantic information of a given scenario. However, the SGG model's performance in predicting more fine-grained predicates is hindered by a significant predicate bias. According to existing works, the long-tail distribution of predicates in training data results in the biased scene graph. However, the semantic overlap between predicate categories makes predicate prediction difficult, and there is a significant difference in the sample size of semantically similar predicates, making the predicate prediction more difficult. Therefore, higher requirements are placed on the discriminative ability of the model. In order to address this problem, this paper proposes Ensemble Predicate Decoding (EPD), which employs multiple decoders to attain unbiased scene graph generation. Two auxiliary decoders trained on lower-frequency predicates are used to improve the discriminative ability of the model. Extensive experiments are conducted on the VG, and the experiment results show that EPD enhances the model's representation capability for predicates. In addition, we find that our approach ensures a relatively superior predictive capability for more frequent predicates compared to previous unbiased SGG methods.

Ensemble Predicate Decoding for Unbiased Scene Graph Generation

TL;DR

This work tackles predicate bias in scene graph generation caused by long-tail distributions and semantic overlap among predicates. It introduces Ensemble Predicate Decoding, a model-agnostic approach that uses a main decoder plus two auxiliary decoders trained on distinct predicate-frequency subsets to expand discriminative capacity, especially for infrequent and semantically similar predicates. By ensembling predictions from multiple decoders and using a carefully balanced loss with data partitions, the method significantly improves tail predicate accuracy (mR@K) with minimal sacrifice to head predicates and preserves overall performance across VG benchmarks. The approach demonstrates robust gains across baselines like Motifs and VCTree, and is supported by extensive ablations and hyperparameter analyses that validate the design choices and practical impact for unbiased SGG.

Abstract

Scene Graph Generation (SGG) aims to generate a comprehensive graphical representation that accurately captures the semantic information of a given scenario. However, the SGG model's performance in predicting more fine-grained predicates is hindered by a significant predicate bias. According to existing works, the long-tail distribution of predicates in training data results in the biased scene graph. However, the semantic overlap between predicate categories makes predicate prediction difficult, and there is a significant difference in the sample size of semantically similar predicates, making the predicate prediction more difficult. Therefore, higher requirements are placed on the discriminative ability of the model. In order to address this problem, this paper proposes Ensemble Predicate Decoding (EPD), which employs multiple decoders to attain unbiased scene graph generation. Two auxiliary decoders trained on lower-frequency predicates are used to improve the discriminative ability of the model. Extensive experiments are conducted on the VG, and the experiment results show that EPD enhances the model's representation capability for predicates. In addition, we find that our approach ensures a relatively superior predictive capability for more frequent predicates compared to previous unbiased SGG methods.
Paper Structure (16 sections, 9 equations, 4 figures, 8 tables)

This paper contains 16 sections, 9 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Scores of top-3 predictions for the triplet $\langle \textit{bird}, \textit{--}, \textit{banana} \rangle$ across different SGG models, which ground truth is standing on. Motifs is affected by bias, and the coarse predicate on receives the highest score. The proposed Ensemble Predicate Decoding (EPD) predicts the triplet as $\langle \textit{bird}, \textit{standing on}, \textit{banana} \rangle$. EPD includes three decoders, the main decoder $\mathcal{MD}$, the auxiliary decoders $\mathcal{AD}_1$ and $\mathcal{AD}_2$.
  • Figure 2: Overview of SGG model using EPD. In the ensemble predicate decoding stage, we process predicate encoding features with multiple decoders. One decoder acts as the main, and the others as auxiliary decoders. Each decoder branch shares parameters partially, generating predicate decoding features $p'_{md}$, $p'_{ad_1}$, $p'_{ad_2}$. After passing through a shared predicate classifier, the decoders' outputs are predicted and subsequently integrated to obtain the result.
  • Figure 3: The predicate distribution in the training subsets ${N_1}$, ${N_2}$ and ${N_3}$ obtained by the average partitioning method, i.e., 16:17:17 (head:body:tail). The predicate samples used for statistics are from the VG dataset.
  • Figure 4: Visualization of collaborative scoring with multiple decoders, using Motifs with EPD. The head predicate triplet $\langle \textit{man}, \textit{has}, \textit{hair} \rangle$ and the body predicate triplet $\langle \textit{dog}, \textit{walking on}, \textit{beach} \rangle$ are taken as examples. For each triplet, the scores given by EPD using different decoder combinations are shown, including: main decoder only; main decoder and auxiliary decoder 1; main decoder, auxiliary decoder 1, and auxiliary decoder 2.