Table of Contents
Fetching ...

RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation

Hae-Won Jo, Yeong-Jun Cho

TL;DR

RS-Net tackles two core challenges in dynamic scene graph generation: the lack of supervision for non-annotated object pairs and the insufficiency of short temporal glimpses. It introduces a modular framework with a spatial context encoder, a temporal context encoder, and a relation scoring decoder to evaluate the contextual relevance of each object pair across an entire video. By integrating a video-level temporal context token into relation representations and multiplying RS-Net’s context score with traditional triplet scores, RS-Net consistently improves Recall, Precision, and mean Recall across diverse DSGG baselines on ActionGenome, while maintaining competitive efficiency. This approach offers a practical, generalizable means to enhance relational reasoning in dynamic scenes, enabling better predicate ranking and more accurate scene graphs in real-world video understanding.

Abstract

Dynamic Scene Graph Generation (DSGG) models how object relations evolve over time in videos. However, existing methods are trained only on annotated object pairs and lack guidance for non-related pairs, making it difficult to identify meaningful relations during inference. In this paper, we propose Relation Scoring Network (RS-Net), a modular framework that scores the contextual importance of object pairs using both spatial interactions and long-range temporal context. RS-Net consists of a spatial context encoder with learnable context tokens and a temporal encoder that aggregates video-level information. The resulting relation scores are integrated into a unified triplet scoring mechanism to enhance relation prediction. RS-Net can be easily integrated into existing DSGG models without architectural changes. Experiments on the Action Genome dataset show that RS-Net consistently improves both Recall and Precision across diverse baselines, with notable gains in mean Recall, highlighting its ability to address the long-tailed distribution of relations. Despite the increased number of parameters, RS-Net maintains competitive efficiency, achieving superior performance over state-of-the-art methods.

RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation

TL;DR

RS-Net tackles two core challenges in dynamic scene graph generation: the lack of supervision for non-annotated object pairs and the insufficiency of short temporal glimpses. It introduces a modular framework with a spatial context encoder, a temporal context encoder, and a relation scoring decoder to evaluate the contextual relevance of each object pair across an entire video. By integrating a video-level temporal context token into relation representations and multiplying RS-Net’s context score with traditional triplet scores, RS-Net consistently improves Recall, Precision, and mean Recall across diverse DSGG baselines on ActionGenome, while maintaining competitive efficiency. This approach offers a practical, generalizable means to enhance relational reasoning in dynamic scenes, enabling better predicate ranking and more accurate scene graphs in real-world video understanding.

Abstract

Dynamic Scene Graph Generation (DSGG) models how object relations evolve over time in videos. However, existing methods are trained only on annotated object pairs and lack guidance for non-related pairs, making it difficult to identify meaningful relations during inference. In this paper, we propose Relation Scoring Network (RS-Net), a modular framework that scores the contextual importance of object pairs using both spatial interactions and long-range temporal context. RS-Net consists of a spatial context encoder with learnable context tokens and a temporal encoder that aggregates video-level information. The resulting relation scores are integrated into a unified triplet scoring mechanism to enhance relation prediction. RS-Net can be easily integrated into existing DSGG models without architectural changes. Experiments on the Action Genome dataset show that RS-Net consistently improves both Recall and Precision across diverse baselines, with notable gains in mean Recall, highlighting its ability to address the long-tailed distribution of relations. Despite the increased number of parameters, RS-Net maintains competitive efficiency, achieving superior performance over state-of-the-art methods.

Paper Structure

This paper contains 19 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Example of our proposed relation scoring method. (a) Initial triplet predictions from existing DSGG methods. (b) Triplets reordered by our proposed relation scoring approach, where contextually important relations are ranked higher based on video-level semantic relevance.
  • Figure 2: Overview of the proposed RS-Net. The framework consists of four main components: (a) object detection and relation representation construction, (b) spatial context encoder, (c) temporal context encoder, and (d) relation scoring decoder. RS-Net is trained to distinguish semantically meaningful relations from irrelevant ones by incorporating both spatial and temporal cues. The resulting scores are used to guide predicate classification and triplet score computation during scene graph generation.
  • Figure 3: Integration of the RS-Net into existing DSGG frameworks for relation scoring. Thanks to its modular design, RS-Net can be incorporated into various DSGG frameworks without requiring major structural modifications.
  • Figure 4: Qualitative comparisons between STTran and STTran with our RS-Net. The 3rd and 4th columns present the top predictions with highest confidence based on R@10 evaluation results in the SGDET setting. Orange and gray indicate detected objects that are involved in ground-truth relations and those that are not involved. Green and blue indicate correctly and incorrectly predicted predicates, respectively. The 5th column shows the relation scores between person and objects predicted by RS-Net.
  • Figure 5: Attention scores visualization of the frame-level context token in the Spatial Context Encoder. (a) Ground-truth subject and object instances, (b) detection results from our method, and (c) attention values from the frame-level context token to relation representations. Orange color indicates object instances involved in ground-truth relations.