Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click
Raphael Ruschel, Hardikkumar Prajapati, Awsafur Rahman, B. S. Manjunath
TL;DR
Click2Graph addresses the lack of user guidance in Panoptic Video Scene Graph Generation by enabling interactive, single-prompt-driven construction of temporally consistent scene graphs. It integrates a Dynamic Interaction Discovery Module (DIDM) that produces subject-conditioned object prompts with a Semantic Classification Head (SCH) that jointly infers subjects, objects, and predicates, producing structured triplets $<s_i, o_{i,j}, r_{i,j}>$ and corresponding masks over time. Built on the SAM2 segmentation backbone, Click2Graph is evaluated on OpenPVSG, demonstrating controllable and interpretable video understanding that combines visual prompting, panoptic grounding, and relational inference. While showing strong spatial grounding and object discovery, the work also highlights semantic classification as the primary bottleneck due to large, fine-grained label spaces, pointing to future work in language integration and interactive supervision to improve predicate reasoning and long-tail learning.
Abstract
State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.
