Table of Contents
Fetching ...

Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

Raphael Ruschel, Hardikkumar Prajapati, Awsafur Rahman, B. S. Manjunath

TL;DR

Click2Graph addresses the lack of user guidance in Panoptic Video Scene Graph Generation by enabling interactive, single-prompt-driven construction of temporally consistent scene graphs. It integrates a Dynamic Interaction Discovery Module (DIDM) that produces subject-conditioned object prompts with a Semantic Classification Head (SCH) that jointly infers subjects, objects, and predicates, producing structured triplets $<s_i, o_{i,j}, r_{i,j}>$ and corresponding masks over time. Built on the SAM2 segmentation backbone, Click2Graph is evaluated on OpenPVSG, demonstrating controllable and interpretable video understanding that combines visual prompting, panoptic grounding, and relational inference. While showing strong spatial grounding and object discovery, the work also highlights semantic classification as the primary bottleneck due to large, fine-grained label spaces, pointing to future work in language integration and interactive supervision to improve predicate reasoning and long-tail learning.

Abstract

State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.

Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

TL;DR

Click2Graph addresses the lack of user guidance in Panoptic Video Scene Graph Generation by enabling interactive, single-prompt-driven construction of temporally consistent scene graphs. It integrates a Dynamic Interaction Discovery Module (DIDM) that produces subject-conditioned object prompts with a Semantic Classification Head (SCH) that jointly infers subjects, objects, and predicates, producing structured triplets and corresponding masks over time. Built on the SAM2 segmentation backbone, Click2Graph is evaluated on OpenPVSG, demonstrating controllable and interpretable video understanding that combines visual prompting, panoptic grounding, and relational inference. While showing strong spatial grounding and object discovery, the work also highlights semantic classification as the primary bottleneck due to large, fine-grained label spaces, pointing to future work in language integration and interactive supervision to improve predicate reasoning and long-tail learning.

Abstract

State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.

Paper Structure

This paper contains 32 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: On the left example, the user clicked on the $\langle \text{dog} \rangle$, and Click2Graph segmented the $\langle \text{carpet} \rangle$ and predicted the $\langle \text{sitting} \rangle$ activity. On the right, we have a prompt on $\langle \text{child} \rangle$ which yields $\langle \text{dog} \rangle$, $\langle \text{playing} \rangle$ as associated object and activity.
  • Figure 2: Overview of the Click2Graph architecture for user-guided Panoptic Video Scene Graph Generation. From a single user prompt, the system segments and tracks the subject, discovers interacting objects via the Dynamic Interaction Discovery Module (DIDM), and predicts subject--object--predicate triplets using the Semantic Classification Head (SCH).
  • Figure 3: Architecture of the Dynamic Interaction Discovery Module (DIDM). A single user-prompted subject prompt is transformed into $N_q$ predicted object prompts. It combines a feature vector derived from the subject mask with learnable object queries. These tokens pass through a Transformer decoder, which performs cross-attention over the image features, enabling the module to autonomously predict the precise locations (via the Point Prediction Head) of all entities interacting with the prompted subject.
  • Figure 4: Qualitative results illustrating correct predictions, occlusion robustness, and typical failure cases.