Table of Contents
Fetching ...

DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation

Zeeshan Hayder, Xuming He

TL;DR

A new Transformer-based method, called DSGG, that views scene graph detection as a direct graph prediction problem based on a unique set of graph-aware queries, and utilizes a strategy for relation distillation, aiming to efficiently learn multiple instances of semantic relationships.

Abstract

Scene graph generation aims to capture detailed spatial and semantic relationships between objects in an image, which is challenging due to incomplete labelling, long-tailed relationship categories, and relational semantic overlap. Existing Transformer-based methods either employ distinct queries for objects and predicates or utilize holistic queries for relation triplets and hence often suffer from limited capacity in learning low-frequency relationships. In this paper, we present a new Transformer-based method, called DSGG, that views scene graph detection as a direct graph prediction problem based on a unique set of graph-aware queries. In particular, each graph-aware query encodes a compact representation of both the node and all of its relations in the graph, acquired through the utilization of a relaxed sub-graph matching during the training process. Moreover, to address the problem of relational semantic overlap, we utilize a strategy for relation distillation, aiming to efficiently learn multiple instances of semantic relationships. Extensive experiments on the VG and the PSG datasets show that our model achieves state-of-the-art results, showing a significant improvement of 3.5\% and 6.7\% in mR@50 and mR@100 for the scene-graph generation task and achieves an even more substantial improvement of 8.5\% and 10.3\% in mR@50 and mR@100 for the panoptic scene graph generation task. Code is available at \url{https://github.com/zeeshanhayder/DSGG}.

DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation

TL;DR

A new Transformer-based method, called DSGG, that views scene graph detection as a direct graph prediction problem based on a unique set of graph-aware queries, and utilizes a strategy for relation distillation, aiming to efficiently learn multiple instances of semantic relationships.

Abstract

Scene graph generation aims to capture detailed spatial and semantic relationships between objects in an image, which is challenging due to incomplete labelling, long-tailed relationship categories, and relational semantic overlap. Existing Transformer-based methods either employ distinct queries for objects and predicates or utilize holistic queries for relation triplets and hence often suffer from limited capacity in learning low-frequency relationships. In this paper, we present a new Transformer-based method, called DSGG, that views scene graph detection as a direct graph prediction problem based on a unique set of graph-aware queries. In particular, each graph-aware query encodes a compact representation of both the node and all of its relations in the graph, acquired through the utilization of a relaxed sub-graph matching during the training process. Moreover, to address the problem of relational semantic overlap, we utilize a strategy for relation distillation, aiming to efficiently learn multiple instances of semantic relationships. Extensive experiments on the VG and the PSG datasets show that our model achieves state-of-the-art results, showing a significant improvement of 3.5\% and 6.7\% in mR@50 and mR@100 for the scene-graph generation task and achieves an even more substantial improvement of 8.5\% and 10.3\% in mR@50 and mR@100 for the panoptic scene graph generation task. Code is available at \url{https://github.com/zeeshanhayder/DSGG}.
Paper Structure (38 sections, 8 equations, 2 figures, 6 tables)

This paper contains 38 sections, 8 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Illustration of different queries used in SGG networks. a) Multi-query transformer networks learn entities and predicates separately. b) Triplet query-based transformer networks use a separate query for each triplet. c) Our proposed graph-aware queries learn a compact representation of objects and all of its relations jointly.
  • Figure 2: An illustration of the DSGG architecture. The proposed method adopts a single-stage transformer architecture that employs graph-aware queries to predict the scene graph. The input image is first processed by the backbone network and then passed through the transformer to extract the compositional tokens. These tokens are used to learn the class confidence, bounding box, and segmentation. Additionally, a dense relation embedding module is used to learn the pairwise relation between each object in the image. A prediction graph is then generated and compared against the ground truth graph to find the optimal permutation of nodes. To rank the final relations, dense relation distillation and re-scoring modules are used.