Table of Contents
Fetching ...

BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation

Peng Hao, Weilong Wang, Xiaobing Wang, Yingying Jiang, Hanchao Jia, Shaowei Cui, Junhang Wei, Xiaoshuai Hao

TL;DR

BCTR tackles Scene Graph Generation by introducing a bidirectional conditioning factorization that enables mutual refinement between entity and predicate predictions within a semantic-aligned feature space. The Bidirectional Conditioning Generator (BCG) achieves this through internal bidirectional attention and iterative refinement, while Random Feature Alignment (RFA) regularizes features by distilling from Vision-Language Pre-trained Models and initializing CLIP-based classifiers. The combination yields improved generalization to unseen but related relationships and achieves state-of-the-art results on Visual Genome and Open Images V6, with strong gains in tail-category performance and zero-shot triplets. These results suggest that learning interaction patterns in a semantically aligned space, reinforced by multimodal distillation, enhances both accuracy and robustness for open-set relational reasoning in SGG. The approach advances practical scene understanding by delivering more reliable and balanced relational predictions for complex images.

Abstract

Scene Graph Generation (SGG) remains a challenging task due to its compositional property. Previous approaches improve prediction efficiency through end-to-end learning. However, these methods exhibit limited performance as they assume unidirectional conditioning between entities and predicates, which restricts effective information interaction. To address this limitation, we propose a novel bidirectional conditioning factorization in a semantic-aligned space for SGG, enabling efficient and generalizable interaction between entities and predicates. Specifically, we introduce an end-to-end scene graph generation model, the Bidirectional Conditioning Transformer (BCTR), to implement this factorization. BCTR consists of two key modules. First, the Bidirectional Conditioning Generator (BCG) performs multi-stage interactive feature augmentation between entities and predicates, enabling mutual enhancement between these predictions. Second, Random Feature Alignment (RFA) is present to regularize feature space by distilling multi-modal knowledge from pre-trained models. Within this regularized feature space, BCG is feasible to capture interaction patterns across diverse relationships during training, and the learned interaction patterns can generalize to unseen but semantically related relationships during inference. Extensive experiments on Visual Genome and Open Image V6 show that BCTR achieves state-of-the-art performance on both benchmarks.

BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation

TL;DR

BCTR tackles Scene Graph Generation by introducing a bidirectional conditioning factorization that enables mutual refinement between entity and predicate predictions within a semantic-aligned feature space. The Bidirectional Conditioning Generator (BCG) achieves this through internal bidirectional attention and iterative refinement, while Random Feature Alignment (RFA) regularizes features by distilling from Vision-Language Pre-trained Models and initializing CLIP-based classifiers. The combination yields improved generalization to unseen but related relationships and achieves state-of-the-art results on Visual Genome and Open Images V6, with strong gains in tail-category performance and zero-shot triplets. These results suggest that learning interaction patterns in a semantically aligned space, reinforced by multimodal distillation, enhances both accuracy and robustness for open-set relational reasoning in SGG. The approach advances practical scene understanding by delivering more reliable and balanced relational predictions for complex images.

Abstract

Scene Graph Generation (SGG) remains a challenging task due to its compositional property. Previous approaches improve prediction efficiency through end-to-end learning. However, these methods exhibit limited performance as they assume unidirectional conditioning between entities and predicates, which restricts effective information interaction. To address this limitation, we propose a novel bidirectional conditioning factorization in a semantic-aligned space for SGG, enabling efficient and generalizable interaction between entities and predicates. Specifically, we introduce an end-to-end scene graph generation model, the Bidirectional Conditioning Transformer (BCTR), to implement this factorization. BCTR consists of two key modules. First, the Bidirectional Conditioning Generator (BCG) performs multi-stage interactive feature augmentation between entities and predicates, enabling mutual enhancement between these predictions. Second, Random Feature Alignment (RFA) is present to regularize feature space by distilling multi-modal knowledge from pre-trained models. Within this regularized feature space, BCG is feasible to capture interaction patterns across diverse relationships during training, and the learned interaction patterns can generalize to unseen but semantically related relationships during inference. Extensive experiments on Visual Genome and Open Image V6 show that BCTR achieves state-of-the-art performance on both benchmarks.
Paper Structure (36 sections, 16 equations, 4 figures, 9 tables)

This paper contains 36 sections, 16 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The motivation and pipeline of BCTR. (a) Comparison of conditioning approaches in SGG. While bidirectional conditioning generates relationships through mutual feature augmentation even when an entity is missing, it struggles to generalize to unseen categories. Our approach learns bidirectional conditioning in a semantic-aligned space, achieving better generalization to unseen but semantically related relationships. (b) The left panel illustrates the extraction of CLIP, entity, and predicate features from the input image, while the right panel shows how BCTR enhances entity (e.g., "Bag") and predicate (e.g., "Behind") detection through bidirectional interaction in the semantic-aligned space.
  • Figure 2: Overview of the BCTR. (a) Visual and entity features are extracted from the input image. (b) The entity-aware predicate queries and entity queries are iteratively updated through the proposed BCG. (c) During training, the output features from various decoders are regularized by the RFA. Final predictions are generated from these distilled features.
  • Figure 3: Random Feature Alignment for the entity and predicate prediction. First, the decoder features are randomly distilled with CLIP features. Then, the classifier weights are initialized with vectors generated by the CLIP text decoder, which encodes the ground-truth labels. This alignment ensures that the visual features can be accurately classified.
  • Figure 4: Qualitative results of our method and another method on the VG dataset. When leveraging identical DETR-based detectors, the bidirectional interaction mechanism of BCTR reduces missed detections (highlighted as yellow nodes) and enhances performance on the SGG task.