Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Ziqi Gao; Jieyu Zhang; Wisdom Oluchi Ikezogwo; Jae Sung Park; Tario G. You; Daniel Ogbu; Chenhao Zheng; Weikai Huang; Yinuo Yang; Winson Han; Quan Kong; Rajat Saini; Ranjay Krishna

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna

TL;DR

SVG2, a large-scale panoptic video scene graph dataset, is introduced, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets, and TRaSER, a video scene graph generation model, is trained.

Abstract

We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL's generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

TL;DR

Abstract

Paper Structure (43 sections, 14 equations, 12 figures, 16 tables)

This paper contains 43 sections, 14 equations, 12 figures, 16 tables.

Introduction
Related work
Synthetic Visual Genome 2
Automatic pipeline
Extracted dataset
TraSeR
Trajectory-Aligned Token Arrangement
Dual Resampler Module
Training
Training datasets.
Implementation Details.
Experiment
Baselines.
Evaluation Setup.
Scene graph generation results.
...and 28 more sections

Figures (12)

Figure 1: Synthetic Visual Genome 2 (SVG2), a large-scale synthetic panoptic video scene graph dataset. SVG2 provides dense panoptic trajectories, fine-grained object categories and attributes, and temporally grounded spatialtemporal relations across over 636K videos, which is an order-of-magnitude increase in scale and diversity over prior datasets.
Figure 2: Overview of SVG2 synthesis pipeline. Phase 1: panoptic trajectory generation with online--offline object tracking mechanism that discovers new objects and preserves identity consistency. Phase 2: per-trajectory description and semantic parsing. Phase 3: GPT5–based spatiotemporal relation inference to produce the final video scene graph.
Figure 3: TraSeR architecture. The model first performs trajectory-aligned token arrangement, grounding ViT tokens to instance trajectories to form identity-preserving token streams. It then applies an object-trajectory resampler to aggregate global semantics over each full trajectory, and a temporal-window resampler to preserve fine-grained motion and temporal cues. The resulting tokens are decoded by the language model into a structured video scene graph.
Figure 4: Distribution of object categories in SVG2. The dataset covers a diverse range of semantic classes including persons, vehicles, animals, furniture, and various everyday objects.
Figure 5: Distribution of attribute annotations in SVG2. The attributes span visual properties such as color, material, state, and other fine-grained descriptors.
...and 7 more figures

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

TL;DR

Abstract

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (12)