SGTR+: End-to-end Scene Graph Generation with Transformer
Rongjie Li, Songyang Zhang, Xuming He
TL;DR
SGTR reframes scene graph generation as end-to-end bipartite graph construction, enabling joint generation of entity and predicate nodes and a differentiable graph assembling module to form relation triplets. The base SGTR uses a DETR-like entity detector and a structured predicate decoder, while SGTR+ adds a spatially aware predicate node generator and a unified graph assembling mechanism for end-to-end optimization. The approach achieves state-of-the-art or competitive results on Visual Genome, OpenImages-V6, and GQA, with improved efficiency and robustness, particularly in long-tail and complex scenes. This work provides a scalable, end-to-end framework that leverages explicit entity-predicate compositional modeling to enhance visual relationship reasoning and downstream scene understanding tasks.
Abstract
Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property. Most previous works adopt a bottom-up, two-stage or point-based, one-stage approach, which often suffers from high time complexity or suboptimal designs. In this work, we propose a novel SGG method to address the aforementioned issues, formulating the task as a bipartite graph construction problem. To address the issues above, we create a transformer-based end-to-end framework to generate the entity and entity-aware predicate proposal set, and infer directed edges to form relation triplets. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Based on bipartite graph assembling paradigm, we further propose a new technical design to address the efficacy of entity-aware modeling and optimization stability of graph assembling. Equipped with the enhanced entity-aware design, our method achieves optimal performance and time-complexity. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on three challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference. Code is available: https://github.com/Scarecrow0/SGTR
