Table of Contents
Fetching ...

Unbiased Scene Graph Generation from Biased Training

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, Hanwang Zhang

TL;DR

The paper tackles severe training bias in scene graph generation by introducing a causal-inference framework that uses counterfactual reasoning and the Total Direct Effect ($TDE$) to yield unbiased predicate predictions. It formalizes a general SGG causal graph, defines interventions and counterfactuals, and derives an unbiased prediction $y^{\dagger}_{e}$ that subtracts context-induced bias. Through extensive experiments on Visual Genome across multiple SGG architectures, it demonstrates that $TDE$-based predictions improve bias-sensitive metrics (e.g., RR, mR@K, ZSRR) and enhance downstream graph coherence, supported by a new Scene Graph Diagnosis toolkit. The approach is model-agnostic and sheds light on separating useful context priors from harmful long-tailed biases, which can improve tasks like VQA that rely on richer scene structures.

Abstract

Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse "human walk on / sit on / lay on beach" into "human on beach". Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG is not trivial because traditional debiasing methods cannot distinguish between the good and bad bias, e.g., good context prior (e.g., "person read book" rather than "eat") and bad long-tailed bias (e.g., "near" dominating "behind / in front of"). In this paper, we present a novel SGG framework based on causal inference but not the conventional likelihood. We first build a causal graph for SGG, and perform traditional biased training with the graph. Then, we propose to draw the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed. In particular, we use Total Direct Effect (TDE) as the proposed final predicate score for unbiased SGG. Note that our framework is agnostic to any SGG model and thus can be widely applied in the community who seeks unbiased predictions. By using the proposed Scene Graph Diagnosis toolkit on the SGG benchmark Visual Genome and several prevailing models, we observed significant improvements over the previous state-of-the-art methods.

Unbiased Scene Graph Generation from Biased Training

TL;DR

The paper tackles severe training bias in scene graph generation by introducing a causal-inference framework that uses counterfactual reasoning and the Total Direct Effect () to yield unbiased predicate predictions. It formalizes a general SGG causal graph, defines interventions and counterfactuals, and derives an unbiased prediction that subtracts context-induced bias. Through extensive experiments on Visual Genome across multiple SGG architectures, it demonstrates that -based predictions improve bias-sensitive metrics (e.g., RR, mR@K, ZSRR) and enhance downstream graph coherence, supported by a new Scene Graph Diagnosis toolkit. The approach is model-agnostic and sheds light on separating useful context priors from harmful long-tailed biases, which can improve tasks like VQA that rely on richer scene structures.

Abstract

Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse "human walk on / sit on / lay on beach" into "human on beach". Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG is not trivial because traditional debiasing methods cannot distinguish between the good and bad bias, e.g., good context prior (e.g., "person read book" rather than "eat") and bad long-tailed bias (e.g., "near" dominating "behind / in front of"). In this paper, we present a novel SGG framework based on causal inference but not the conventional likelihood. We first build a causal graph for SGG, and perform traditional biased training with the graph. Then, we propose to draw the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed. In particular, we use Total Direct Effect (TDE) as the proposed final predicate score for unbiased SGG. Note that our framework is agnostic to any SGG model and thus can be widely applied in the community who seeks unbiased predictions. By using the proposed Scene Graph Diagnosis toolkit on the SGG benchmark Visual Genome and several prevailing models, we observed significant improvements over the previous state-of-the-art methods.

Paper Structure

This paper contains 23 sections, 15 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: An example of scene graph generation (SGG). (a) An input image with bounding boxes. (b) The distribution of sample fraction for the most frequent 20 predicates in Visual Genome krishna2017visual. (c) SGG from re-implemented MOTIFS zellers2018neural. (d) SGG by the proposed unbiased prediction from the same model.
  • Figure 2: (a) The biased generation that directly predicts labels from likelihood. (b) An intuitive example of the proposed total direct effect, which calculates the difference between the real scene and the counterfactual one. Note that the "wipe-out" is only for the illustrative purpose but not considered as visual processing.
  • Figure 3: (a) The example of total direct effect calculation and corresponding operations on the causal graph, where $\bar{X}$ represents wiped-out $X$. (b) Recall@100 of Predicate Classification for selected predicates ranking by sampling fraction. The biased generation refers to re-implemented MOTIFS zellers2018neural and the proposed unbiased generation is the result from the same model using TDE.
  • Figure 4: (a) The framework used in our biased training. (b) The causal graph of the SGG framework. (c) An illustration of the proposed TDE inference.
  • Figure 5: The original causal graph of SGG together with two interventional and counterfactual alternates.
  • ...and 11 more figures