SGTR+: End-to-end Scene Graph Generation with Transformer

Rongjie Li; Songyang Zhang; Xuming He

SGTR+: End-to-end Scene Graph Generation with Transformer

Rongjie Li, Songyang Zhang, Xuming He

TL;DR

SGTR reframes scene graph generation as end-to-end bipartite graph construction, enabling joint generation of entity and predicate nodes and a differentiable graph assembling module to form relation triplets. The base SGTR uses a DETR-like entity detector and a structured predicate decoder, while SGTR+ adds a spatially aware predicate node generator and a unified graph assembling mechanism for end-to-end optimization. The approach achieves state-of-the-art or competitive results on Visual Genome, OpenImages-V6, and GQA, with improved efficiency and robustness, particularly in long-tail and complex scenes. This work provides a scalable, end-to-end framework that leverages explicit entity-predicate compositional modeling to enhance visual relationship reasoning and downstream scene understanding tasks.

Abstract

Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property. Most previous works adopt a bottom-up, two-stage or point-based, one-stage approach, which often suffers from high time complexity or suboptimal designs. In this work, we propose a novel SGG method to address the aforementioned issues, formulating the task as a bipartite graph construction problem. To address the issues above, we create a transformer-based end-to-end framework to generate the entity and entity-aware predicate proposal set, and infer directed edges to form relation triplets. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Based on bipartite graph assembling paradigm, we further propose a new technical design to address the efficacy of entity-aware modeling and optimization stability of graph assembling. Equipped with the enhanced entity-aware design, our method achieves optimal performance and time-complexity. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on three challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference. Code is available: https://github.com/Scarecrow0/SGTR

SGTR+: End-to-end Scene Graph Generation with Transformer

TL;DR

Abstract

Paper Structure (41 sections, 33 equations, 15 figures, 13 tables)

This paper contains 41 sections, 33 equations, 15 figures, 13 tables.

Introduction
Related Work
Preliminary
Problem Setting
Method Overview
Our Approach
Backbone Network and Entity Node Generator
Predicate Node Generator
Predicate Encoder
Predicate Query Initialization
Structural Predicate Node Decoder
Bipartite Graph Assembling
Node Correspondence Score
Relationship Triplet Generation
SGTR+
...and 26 more sections

Figures (15)

Figure 1: An overview of SGTR pipeline for scene graph generation. We formulate SGG as a bipartite graph construction process. First, the entity and predicate nodes are generated, respectively. Then we assemble the bipartite scene graph from two types of nodes.
Figure 2: An illustration of overall pipeline of our SGTR+ model.Left) We use a CNN backbone together with a transformer encoder for image feature extraction. The entity and predicate node generators are introduced to produce the entity node and entity-aware predicate node. A graph assembling mechanism is developed to construct the final bipartite scene graph. Right) The predicate node generator consists of three parts: a) predicate query initialization, b) a predicate encoder, and c) a structural predicate node decoder, which is designed to generate entity-aware predicate nodes.
Figure 3: The illustration of Unified Bipartite Graph Assembling. In SGTR, the partial differentiable top-k selection constructs relation triplets that match GT scene graphs (the upper orange block). The unified GA of SGTR+ (the lower green block) adopts weight sum to construct relation triplets, allowing GT to optimize all predicate entity associations.
Figure 4: The illustration of improved predicate node generator. The left part illustrates the initialization of the spatial-aware predicate query. The right part illustrates the spatial-aware entity indicator sub-decoder.
Figure 5: The complexity comparison with SSRCNN.
...and 10 more figures

SGTR+: End-to-end Scene Graph Generation with Transformer

TL;DR

Abstract

SGTR+: End-to-end Scene Graph Generation with Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (15)