Table of Contents
Fetching ...

DDS: Decoupled Dynamic Scene-Graph Generation Network

A S M Iftekhar, Raphael Ruschel, Satish Kumar, Suya You, B. S. Manjunath

TL;DR

DDS tackles the challenge of predicting unseen relationship triplets in dynamic scene graphs by decoupling object and relationship representations into two independent Transformer branches with distinct encoders, decoders, and query sets. The model uses a temporal decoder to embed cross-frame information and separate spatio-temporal decoders to produce discriminative embeddings that feed dedicated object and relation heads, enabling robust compositional generalization. Training relies on Hungarian matching and a multi-term loss that jointly optimizes bounding boxes and labels for subjects, objects, and relations, including relation regions used only during training. Empirically, DDS achieves state-of-the-art results on Action Genome, HICO-DET, and UnRel, with particularly large gains on unseen triplets, demonstrating improved generalization for compositional scene-graph generation in both DSG and SSG tasks.

Abstract

Scene-graph generation involves creating a structural representation of the relationships between objects in a scene by predicting subject-object-relation triplets from input data. Existing methods show poor performance in detecting triplets outside of a predefined set, primarily due to their reliance on dependent feature learning. To address this issue, we propose DDS -- a decoupled dynamic scene-graph generation network -- that consists of two independent branches that can disentangle extracted features. The key innovation of the current paper is the decoupling of the features representing the relationships from those of the objects, which enables the detection of novel object-relationship combinations. The DDS model is evaluated on three datasets and outperforms previous methods by a significant margin, especially in detecting previously unseen triplets.

DDS: Decoupled Dynamic Scene-Graph Generation Network

TL;DR

DDS tackles the challenge of predicting unseen relationship triplets in dynamic scene graphs by decoupling object and relationship representations into two independent Transformer branches with distinct encoders, decoders, and query sets. The model uses a temporal decoder to embed cross-frame information and separate spatio-temporal decoders to produce discriminative embeddings that feed dedicated object and relation heads, enabling robust compositional generalization. Training relies on Hungarian matching and a multi-term loss that jointly optimizes bounding boxes and labels for subjects, objects, and relations, including relation regions used only during training. Empirically, DDS achieves state-of-the-art results on Action Genome, HICO-DET, and UnRel, with particularly large gains on unseen triplets, demonstrating improved generalization for compositional scene-graph generation in both DSG and SSG tasks.

Abstract

Scene-graph generation involves creating a structural representation of the relationships between objects in a scene by predicting subject-object-relation triplets from input data. Existing methods show poor performance in detecting triplets outside of a predefined set, primarily due to their reliance on dependent feature learning. To address this issue, we propose DDS -- a decoupled dynamic scene-graph generation network -- that consists of two independent branches that can disentangle extracted features. The key innovation of the current paper is the decoupling of the features representing the relationships from those of the objects, which enables the detection of novel object-relationship combinations. The DDS model is evaluated on three datasets and outperforms previous methods by a significant margin, especially in detecting previously unseen triplets.
Paper Structure (23 sections, 3 equations, 5 figures, 5 tables)

This paper contains 23 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Diagram to show the concept learning and transferring in DDS. By focusing on different spatial regions, DDS learns the concept of relationships (ride, on) and objects (person, bicycle, bed) independently.
  • Figure 2: Overview of DDS's architecture. Given an input frame $\mathbf{I}_{t}$, features are extracted by the backbone and fed to decoupled object and relation branches, each with an encoder and spatio-temporal decoder. The decoders process queries and previous frame embeddings (red arrow) to produce learned embeddings, which are used by the object and relation heads to predict relationship triplets.
  • Figure 3: Design of the spatio-temporal decoders using the Object decoder as an example. The relationship decoder uses the same architecture, however, with its corresponding inputs adjusted.
  • Figure 4: Qualitative results of DDS for predicting unusual relationship triplets in UnRelpeyre2017weakly dataset. The subject bounding box is green and the object bounding box is red.
  • Figure 5: Performance analysis of DDS over the base network. The attention maps are visualized from the last layer of the spatio-temporal decoder.