Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Tim Salzmann; Markus Ryll; Alex Bewley; Matthias Minderer

Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Tim Salzmann, Markus Ryll, Alex Bewley, Matthias Minderer

TL;DR

This work tackles open-vocabulary visual relationship detection by moving away from decoder-centric VRD models and toward an encoder-only Scene-Graph ViT that learns object relationships directly within the image backbone. A key component is the Relationship Attention mechanism, which hard-selects a small set of high-confidence <subject-object> pairs via scores p_{ij} = s_i o_j^T, with relationship embeddings computed as $r_{ij} = ext{LayerNorm}(s_i + o_j)$ and classified against disentangled text embeddings for objects and predicates. The model is trained end-to-end on a mixture of object and relationship datasets, using bipartite matching and four losses that include $L_1$ and $gIoU$ for boxes, sigmoid cross-entropy for object/predicate classification, and a relationship-score loss. Empirically, Scene-Graph ViT achieves state-of-the-art VRD performance on Visual Genome (graph-constrained) and large-vocabulary GQA with real-time inference, while maintaining good object-detection performance and offering flexible, end-to-end training across diverse data—though challenges remain in HOI tasks and zero-shot generalization to unseen classes.

Abstract

Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide ablations, real-world qualitative examples, and analyses of zero-shot performance.

Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

TL;DR

and classified against disentangled text embeddings for objects and predicates. The model is trained end-to-end on a mixture of object and relationship datasets, using bipartite matching and four losses that include

and

for boxes, sigmoid cross-entropy for object/predicate classification, and a relationship-score loss. Empirically, Scene-Graph ViT achieves state-of-the-art VRD performance on Visual Genome (graph-constrained) and large-vocabulary GQA with real-time inference, while maintaining good object-detection performance and offering flexible, end-to-end training across diverse data—though challenges remain in HOI tasks and zero-shot generalization to unseen classes.

Abstract

Paper Structure (39 sections, 7 figures, 7 tables)

This paper contains 39 sections, 7 figures, 7 tables.

Introduction
Related Work
Detector-Agnostic Relationship Detection.
End-to-End Relationship Detection.
Long-Tailed and Open-Vocabulary Relationship Detection.
Scene-Graph ViT
Encoder-Only Open-Vocabulary Object Detection
Extension to Relationship Prediction
Training
Experiments
Experimental Setup
Datasets.
Training.
Evaluation Procedure and Metrics.
Exhaustive Relationship Evaluation.
...and 24 more sections

Figures (7)

Figure 2: For relationship selection, image tokens are first projected using two lightweight MLPs to produce <subject> and <object> embeddings. A relationship score is then computed as the inner product between all <subject> and <object> embeddings. Relationships are filtered by first selecting the top object instances, using the scores along the diagonal to represent instance likelihood. Among the remaining instances, the top <subject-object> pairs are selected using the off-diagonal scores. This yields a set of relationship triplets, each consisting of a <subject> index, an <object> index, and a relationship embedding that is computed by summing the respective <subject> and <object> embeddings. For classification, the relationship embeddings are compared against text embeddings of object class or predicate text descriptions.
Figure 3: Model speed and VRD accuracy by number of selected relationships $k$. Speed is relative to a non-VRD object detector minderer2022simple.
Figure 4: Qualitative examples of difficult edge-cases from VG150 test split krishna2017visualzellers2018neural. From left to right: Ground Truth, SG-ViT (B/32), SG-ViT (L/14). In all cases the <subject> is lime and the <object> is red.
Figure 5: Qualitative examples showing SG-ViT (L/14) on out-of-distribution data from the OXE dataset open_x_embodiment_rt_x_2023. In all cases the <subject> is lime and the <object> is red. Note how the model correctly disambiguates several instances of the same class (e.g. "bananas" and "bottle") depending on their relationships.
Figure 6: Computing <subject> and <object> embeddings with separate MLPs is necessary for good performance.
...and 2 more figures

Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

TL;DR

Abstract

Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)