Table of Contents
Fetching ...

SGTR+: End-to-end Scene Graph Generation with Transformer

Rongjie Li, Songyang Zhang, Xuming He

TL;DR

SGTR reframes scene graph generation as end-to-end bipartite graph construction, enabling joint generation of entity and predicate nodes and a differentiable graph assembling module to form relation triplets. The base SGTR uses a DETR-like entity detector and a structured predicate decoder, while SGTR+ adds a spatially aware predicate node generator and a unified graph assembling mechanism for end-to-end optimization. The approach achieves state-of-the-art or competitive results on Visual Genome, OpenImages-V6, and GQA, with improved efficiency and robustness, particularly in long-tail and complex scenes. This work provides a scalable, end-to-end framework that leverages explicit entity-predicate compositional modeling to enhance visual relationship reasoning and downstream scene understanding tasks.

Abstract

Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property. Most previous works adopt a bottom-up, two-stage or point-based, one-stage approach, which often suffers from high time complexity or suboptimal designs. In this work, we propose a novel SGG method to address the aforementioned issues, formulating the task as a bipartite graph construction problem. To address the issues above, we create a transformer-based end-to-end framework to generate the entity and entity-aware predicate proposal set, and infer directed edges to form relation triplets. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Based on bipartite graph assembling paradigm, we further propose a new technical design to address the efficacy of entity-aware modeling and optimization stability of graph assembling. Equipped with the enhanced entity-aware design, our method achieves optimal performance and time-complexity. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on three challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference. Code is available: https://github.com/Scarecrow0/SGTR

SGTR+: End-to-end Scene Graph Generation with Transformer

TL;DR

SGTR reframes scene graph generation as end-to-end bipartite graph construction, enabling joint generation of entity and predicate nodes and a differentiable graph assembling module to form relation triplets. The base SGTR uses a DETR-like entity detector and a structured predicate decoder, while SGTR+ adds a spatially aware predicate node generator and a unified graph assembling mechanism for end-to-end optimization. The approach achieves state-of-the-art or competitive results on Visual Genome, OpenImages-V6, and GQA, with improved efficiency and robustness, particularly in long-tail and complex scenes. This work provides a scalable, end-to-end framework that leverages explicit entity-predicate compositional modeling to enhance visual relationship reasoning and downstream scene understanding tasks.

Abstract

Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property. Most previous works adopt a bottom-up, two-stage or point-based, one-stage approach, which often suffers from high time complexity or suboptimal designs. In this work, we propose a novel SGG method to address the aforementioned issues, formulating the task as a bipartite graph construction problem. To address the issues above, we create a transformer-based end-to-end framework to generate the entity and entity-aware predicate proposal set, and infer directed edges to form relation triplets. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Based on bipartite graph assembling paradigm, we further propose a new technical design to address the efficacy of entity-aware modeling and optimization stability of graph assembling. Equipped with the enhanced entity-aware design, our method achieves optimal performance and time-complexity. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on three challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference. Code is available: https://github.com/Scarecrow0/SGTR
Paper Structure (41 sections, 33 equations, 15 figures, 13 tables)

This paper contains 41 sections, 33 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: An overview of SGTR pipeline for scene graph generation. We formulate SGG as a bipartite graph construction process. First, the entity and predicate nodes are generated, respectively. Then we assemble the bipartite scene graph from two types of nodes.
  • Figure 2: An illustration of overall pipeline of our SGTR+ model.Left) We use a CNN backbone together with a transformer encoder for image feature extraction. The entity and predicate node generators are introduced to produce the entity node and entity-aware predicate node. A graph assembling mechanism is developed to construct the final bipartite scene graph. Right) The predicate node generator consists of three parts: a) predicate query initialization, b) a predicate encoder, and c) a structural predicate node decoder, which is designed to generate entity-aware predicate nodes.
  • Figure 3: The illustration of Unified Bipartite Graph Assembling. In SGTR, the partial differentiable top-k selection constructs relation triplets that match GT scene graphs (the upper orange block). The unified GA of SGTR+ (the lower green block) adopts weight sum to construct relation triplets, allowing GT to optimize all predicate entity associations.
  • Figure 4: The illustration of improved predicate node generator. The left part illustrates the initialization of the spatial-aware predicate query. The right part illustrates the spatial-aware entity indicator sub-decoder.
  • Figure 5: The complexity comparison with SSRCNN.
  • ...and 10 more figures