OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Guan Wang; Zhimin Li; Qingchao Chen; Yang Liu

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Guan Wang, Zhimin Li, Qingchao Chen, Yang Liu

TL;DR

This work tackles dynamic scene graph generation (DSGG) in videos and the inefficiency of multi-stage pipelines that separately handle detection, association, and relation classification. It introduces OED, a one-stage end-to-end framework that models $P(\langle s,p,o\rangle|V)$ as a set prediction problem using pair-wise subject-object features, bypassing explicit object tracking. A Progressively Refined Module (PRM) aggregates temporal context by iteratively selecting reference pair-wise features and refining target-frame features, enabling end-to-end training without trackers. On Action Genome, OED achieves state-of-the-art performance in SGDET and strong results in PredCLS, highlighting the value of unified optimization and robust temporal aggregation for DSGG.

Abstract

Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos. Conventional approaches often employ multi-stage pipelines, which typically consist of object detection, temporal association, and multi-relation classification. However, these methods exhibit inherent limitations due to the separation of multiple stages, and independent optimization of these sub-problems may yield sub-optimal solutions. To remedy these limitations, we propose a one-stage end-to-end framework, termed OED, which streamlines the DSGG pipeline. This framework reformulates the task as a set prediction problem and leverages pair-wise features to represent each subject-object pair within the scene graph. Moreover, another challenge of DSGG is capturing temporal dependencies, we introduce a Progressively Refined Module (PRM) for aggregating temporal context without the constraints of additional trackers or handcrafted trajectories, enabling end-to-end optimization of the network. Extensive experiments conducted on the Action Genome benchmark demonstrate the effectiveness of our design. The code and models are available at \url{https://github.com/guanw-pku/OED}.

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

TL;DR

as a set prediction problem using pair-wise subject-object features, bypassing explicit object tracking. A Progressively Refined Module (PRM) aggregates temporal context by iteratively selecting reference pair-wise features and refining target-frame features, enabling end-to-end training without trackers. On Action Genome, OED achieves state-of-the-art performance in SGDET and strong results in PredCLS, highlighting the value of unified optimization and robust temporal aggregation for DSGG.

Abstract

Paper Structure (20 sections, 9 equations, 3 figures, 5 tables)

This paper contains 20 sections, 9 equations, 3 figures, 5 tables.

Introduction
Related Work
Scene Graph Generation
Dynamic Scene Graph Generation
Stacking Multi-stage Pipeline
Modeling Temporal Dependence
Method
Problem Formulation
Overview
Spatial Context Aggregation
Temporal Context Aggregation
Training and Inference
Training
Inference
Experiments
...and 5 more sections

Figures (3)

Figure 1: Comparison between existing multi-stage paradigm and proposed one-stage end-to-end framework. (a) Multi-stage methods, typically detect object instances by individual object detector and may associate objects between frames to aggregate temporal context based on detection results, followed by predicate classification for all candidate subject and object pairs, where tracking maybe lost. (b) Our one-stage end-to-end method, directly generates dynamic scene graph for given video sequence, without individual consideration for object instance detection and tracking. The missing spatial context and predicate temporal dependency could be supplemented with spatial context of reference frames.
Figure 2: OED Framework: Spatial-temporal context aggregation is conducted within a one-stage end-to-end paradigm. Visual features of the target frame and reference frames are extracted using a CNN backbone and a Transformer encoder. Subsequently, two cascaded decoders are employed to aggregate spatial context both within and between pairs. Temporal context is then aggregated in a progressively refined manner, considering pair-wise features of the target frame and reference frames.
Figure 3: Progressively refined long-range global temporal context aggregation.

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

TL;DR

Abstract

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)