Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Tim Salzmann, Markus Ryll, Alex Bewley, Matthias Minderer
TL;DR
This work tackles open-vocabulary visual relationship detection by moving away from decoder-centric VRD models and toward an encoder-only Scene-Graph ViT that learns object relationships directly within the image backbone. A key component is the Relationship Attention mechanism, which hard-selects a small set of high-confidence <subject-object> pairs via scores p_{ij} = s_i o_j^T, with relationship embeddings computed as $r_{ij} = ext{LayerNorm}(s_i + o_j)$ and classified against disentangled text embeddings for objects and predicates. The model is trained end-to-end on a mixture of object and relationship datasets, using bipartite matching and four losses that include $L_1$ and $gIoU$ for boxes, sigmoid cross-entropy for object/predicate classification, and a relationship-score loss. Empirically, Scene-Graph ViT achieves state-of-the-art VRD performance on Visual Genome (graph-constrained) and large-vocabulary GQA with real-time inference, while maintaining good object-detection performance and offering flexible, end-to-end training across diverse data—though challenges remain in HOI tasks and zero-shot generalization to unseen classes.
Abstract
Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide ablations, real-world qualitative examples, and analyses of zero-shot performance.
