DDS: Decoupled Dynamic Scene-Graph Generation Network
A S M Iftekhar, Raphael Ruschel, Satish Kumar, Suya You, B. S. Manjunath
TL;DR
DDS tackles the challenge of predicting unseen relationship triplets in dynamic scene graphs by decoupling object and relationship representations into two independent Transformer branches with distinct encoders, decoders, and query sets. The model uses a temporal decoder to embed cross-frame information and separate spatio-temporal decoders to produce discriminative embeddings that feed dedicated object and relation heads, enabling robust compositional generalization. Training relies on Hungarian matching and a multi-term loss that jointly optimizes bounding boxes and labels for subjects, objects, and relations, including relation regions used only during training. Empirically, DDS achieves state-of-the-art results on Action Genome, HICO-DET, and UnRel, with particularly large gains on unseen triplets, demonstrating improved generalization for compositional scene-graph generation in both DSG and SSG tasks.
Abstract
Scene-graph generation involves creating a structural representation of the relationships between objects in a scene by predicting subject-object-relation triplets from input data. Existing methods show poor performance in detecting triplets outside of a predefined set, primarily due to their reliance on dependent feature learning. To address this issue, we propose DDS -- a decoupled dynamic scene-graph generation network -- that consists of two independent branches that can disentangle extracted features. The key innovation of the current paper is the decoupling of the features representing the relationships from those of the objects, which enables the detection of novel object-relationship combinations. The DDS model is evaluated on three datasets and outperforms previous methods by a significant margin, especially in detecting previously unseen triplets.
