Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement

Yuxuan Wang; Xiaoyuan Liu

Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement

Yuxuan Wang, Xiaoyuan Liu

TL;DR

This work tackles underrepresentation and predicate bias in Scene Graph Generation by leveraging pretrained Vision-Language Models. It introduces LM Estimation, a constrained optimization-based approach to approximate the pretraining predicate distribution ${\pi}_{pt}$, enabling post-hoc logits adjustment $\hat{o}^k(r)=o^k(r)-\log P_{tr}(r)+\log P_{ta}(r)$ to debias predictions from both zero-shot VLMs and SGG heads. A certainty-aware ensemble then combines debiased zero-shot predictions with task-specific SGG predictions on a per-sample basis, using sample confidences to assign dynamic weights, all without additional training. The method significantly improves mean Recall and Recall on Visual Genome, particularly for unseen tail relations, by effectively transferring pretrained knowledge while mitigating language bias. Overall, the training-free LM Estimation plus dynamic ensembling offers a practical pathway to enhance SGG representations using pretrained VLMs and addresses core challenges of bias and underrepresentation in this domain.

Abstract

Scene Graph Generation (SGG) provides basic language representation of visual scenes, requiring models to grasp complex and diverse semantics between objects. This complexity and diversity in SGG leads to underrepresentation, where parts of triplet labels are rare or even unseen during training, resulting in imprecise predictions. To tackle this, we propose integrating the pretrained Vision-language Models to enhance representation. However, due to the gap between pretraining and SGG, direct inference of pretrained VLMs on SGG leads to severe bias, which stems from the imbalanced predicates distribution in the pretraining language set. To alleviate the bias, we introduce a novel LM Estimation to approximate the unattainable predicates distribution. Finally, we ensemble the debiased VLMs with SGG models to enhance the representation, where we design a certainty-aware indicator to score each sample and dynamically adjust the ensemble weights. Our training-free method effectively addresses the predicates bias in pretrained VLMs, enhances SGG's representation, and significantly improve the performance.

Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement

TL;DR

, enabling post-hoc logits adjustment

to debias predictions from both zero-shot VLMs and SGG heads. A certainty-aware ensemble then combines debiased zero-shot predictions with task-specific SGG predictions on a per-sample basis, using sample confidences to assign dynamic weights, all without additional training. The method significantly improves mean Recall and Recall on Visual Genome, particularly for unseen tail relations, by effectively transferring pretrained knowledge while mitigating language bias. Overall, the training-free LM Estimation plus dynamic ensembling offers a practical pathway to enhance SGG representations using pretrained VLMs and addresses core challenges of bias and underrepresentation in this domain.

Abstract

Paper Structure (21 sections, 20 equations, 3 figures, 5 tables)

This paper contains 21 sections, 20 equations, 3 figures, 5 tables.

Introduction
Related Work
Methodology
Setup
Method Overview
Predicate Debiasing
Certainty-aware Ensemble
Summary
Experiment
Experiment Settings
Efficacy Analysis
Estimated Distribution Analysis
Ablation Study
Conclusion
Limitation
...and 6 more sections

Figures (3)

Figure 1: Illustration of the underrepresentation issue in Visual Genome. We highlight the relation class "carrying" from the top-right imbalanced class distribution. We present various samples with their training representation levels and confidence scores for the ground truth class, where lower scores indicate poorer prediction quality. We find that samples less represented by the training set tend to have lower-quality predictions.
Figure 2: Illustration of our proposed architecture. left: the visual-language inputs processed from image regions $\mathbf{x}_{i,j}$ and object labels $(z_i, z_j)$, either provided or predicted by Faster R-CNN detector. middle: the fixed zero-shot VLM $f_\text{zs}$ and the trainable task-specific models $f_\text{sg}$, which we use a fine-tuned VLM as example. right: the relation label debias process and the certainty-aware ensemble.
Figure 3: The relation label distributions on Visual Genome. The upper figure illustrates the distribution across all classes, while the lower one shows the probability distribution on some typical categories. Train Set: The class distribution $\mathbf{\pi_\text{sg}}$ in training set. ViLT and Oscar: The estimated distribution $\mathbf{\pi_\text{pt}}$ using LM Estimation in the two pre-training stages.

Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement

TL;DR

Abstract

Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement

Authors

TL;DR

Abstract

Table of Contents

Figures (3)