Table of Contents
Fetching ...

Situational Scene Graph for Structured Human-centric Situation Understanding

Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando

TL;DR

Situational Scene Graphs (SSG) unify human–object relations with semantic role–value frames to capture multi-action contexts in videos. The paper introduces InComNet, a four-stage transformer-based network that leverages CLIP embeddings and a translucent visual prompt to iteratively predict verb predicates and SRVs for persons, objects, and relations, accompanied by a new SSG dataset built atop Action Genome/Charades. Empirical results show that InComNet outperforms baselines on SSG generation, improves downstream situation recognition and predicate classification, and enhances reasoning in human-centric scenarios, with notable gains from iterative refinement and CLIP fine-tuning. The work also demonstrates SSG’s potential for broader tasks such as video QA, dense captioning, and video generation, while acknowledging annotation costs and suggesting semi-supervised approaches to scale.

Abstract

Graph based representation has been widely used in modelling spatio-temporal relationships in video understanding. Although effective, existing graph-based approaches focus on capturing the human-object relationships while ignoring fine-grained semantic properties of the action components. These semantic properties are crucial for understanding the current situation, such as where does the action takes place, what tools are used and functional properties of the objects. In this work, we propose a graph-based representation called Situational Scene Graph (SSG) to encode both human-object relationships and the corresponding semantic properties. The semantic details are represented as predefined roles and values inspired by situation frame, which is originally designed to represent a single action. Based on our proposed representation, we introduce the task of situational scene graph generation and propose a multi-stage pipeline Interactive and Complementary Network (InComNet) to address the task. Given that the existing datasets are not applicable to the task, we further introduce a SSG dataset whose annotations consist of semantic role-value frames for human, objects and verb predicates of human-object relations. Finally, we demonstrate the effectiveness of our proposed SSG representation by testing on different downstream tasks. Experimental results show that the unified representation can not only benefit predicate classification and semantic role-value classification, but also benefit reasoning tasks on human-centric situation understanding. We will release the code and the dataset soon.

Situational Scene Graph for Structured Human-centric Situation Understanding

TL;DR

Situational Scene Graphs (SSG) unify human–object relations with semantic role–value frames to capture multi-action contexts in videos. The paper introduces InComNet, a four-stage transformer-based network that leverages CLIP embeddings and a translucent visual prompt to iteratively predict verb predicates and SRVs for persons, objects, and relations, accompanied by a new SSG dataset built atop Action Genome/Charades. Empirical results show that InComNet outperforms baselines on SSG generation, improves downstream situation recognition and predicate classification, and enhances reasoning in human-centric scenarios, with notable gains from iterative refinement and CLIP fine-tuning. The work also demonstrates SSG’s potential for broader tasks such as video QA, dense captioning, and video generation, while acknowledging annotation costs and suggesting semi-supervised approaches to scale.

Abstract

Graph based representation has been widely used in modelling spatio-temporal relationships in video understanding. Although effective, existing graph-based approaches focus on capturing the human-object relationships while ignoring fine-grained semantic properties of the action components. These semantic properties are crucial for understanding the current situation, such as where does the action takes place, what tools are used and functional properties of the objects. In this work, we propose a graph-based representation called Situational Scene Graph (SSG) to encode both human-object relationships and the corresponding semantic properties. The semantic details are represented as predefined roles and values inspired by situation frame, which is originally designed to represent a single action. Based on our proposed representation, we introduce the task of situational scene graph generation and propose a multi-stage pipeline Interactive and Complementary Network (InComNet) to address the task. Given that the existing datasets are not applicable to the task, we further introduce a SSG dataset whose annotations consist of semantic role-value frames for human, objects and verb predicates of human-object relations. Finally, we demonstrate the effectiveness of our proposed SSG representation by testing on different downstream tasks. Experimental results show that the unified representation can not only benefit predicate classification and semantic role-value classification, but also benefit reasoning tasks on human-centric situation understanding. We will release the code and the dataset soon.

Paper Structure

This paper contains 35 sections, 1 equation, 10 figures, 15 tables.

Figures (10)

  • Figure 1: This video frame depicts a human-centric situation of the two concurrent actions "sitting on bed" and "holding shoes". Different structured action representation methods include, (a) Scene graph, (b) situation frame, (c) Situational scene graph (ours): encompasses the person, objects, and verb predicate of human-object relations and their semantic role-values, providing a detailed schema with precisely defined structures to elaborate the components of one or more concurrent actions.
  • Figure 2: This video frame illustrates a situation of the action 'holding a dish'. Tables, employing color codes green, blue and red depict the semantic roles and their associated values for the semantic entities person, object and verb predicate of the relation instance.
  • Figure 3: The pipeline of our proposed InComNet. Given a set of video frames, our model uses CLIP to extract necessary feature embeddings from each frame and then classifies SRV of objects, verb predicates, SRV of verb predicates and SRV of person. Finally, the situational scene graph is obtained on the right side. The InComNet stage (II) correspond to the SSG sub-task (1) and stages (I), (III) and (IV) correspond to the SSG sub-task (2).
  • Figure 4: Architectures of verb predicate and SRV encoders.
  • Figure 5: Distribution of object occurrences in SSG dataset.
  • ...and 5 more figures