BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation
Peng Hao, Weilong Wang, Xiaobing Wang, Yingying Jiang, Hanchao Jia, Shaowei Cui, Junhang Wei, Xiaoshuai Hao
TL;DR
BCTR tackles Scene Graph Generation by introducing a bidirectional conditioning factorization that enables mutual refinement between entity and predicate predictions within a semantic-aligned feature space. The Bidirectional Conditioning Generator (BCG) achieves this through internal bidirectional attention and iterative refinement, while Random Feature Alignment (RFA) regularizes features by distilling from Vision-Language Pre-trained Models and initializing CLIP-based classifiers. The combination yields improved generalization to unseen but related relationships and achieves state-of-the-art results on Visual Genome and Open Images V6, with strong gains in tail-category performance and zero-shot triplets. These results suggest that learning interaction patterns in a semantically aligned space, reinforced by multimodal distillation, enhances both accuracy and robustness for open-set relational reasoning in SGG. The approach advances practical scene understanding by delivering more reliable and balanced relational predictions for complex images.
Abstract
Scene Graph Generation (SGG) remains a challenging task due to its compositional property. Previous approaches improve prediction efficiency through end-to-end learning. However, these methods exhibit limited performance as they assume unidirectional conditioning between entities and predicates, which restricts effective information interaction. To address this limitation, we propose a novel bidirectional conditioning factorization in a semantic-aligned space for SGG, enabling efficient and generalizable interaction between entities and predicates. Specifically, we introduce an end-to-end scene graph generation model, the Bidirectional Conditioning Transformer (BCTR), to implement this factorization. BCTR consists of two key modules. First, the Bidirectional Conditioning Generator (BCG) performs multi-stage interactive feature augmentation between entities and predicates, enabling mutual enhancement between these predictions. Second, Random Feature Alignment (RFA) is present to regularize feature space by distilling multi-modal knowledge from pre-trained models. Within this regularized feature space, BCG is feasible to capture interaction patterns across diverse relationships during training, and the learned interaction patterns can generalize to unseen but semantically related relationships during inference. Extensive experiments on Visual Genome and Open Image V6 show that BCTR achieves state-of-the-art performance on both benchmarks.
