Situational Scene Graph for Structured Human-centric Situation Understanding
Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando
TL;DR
Situational Scene Graphs (SSG) unify human–object relations with semantic role–value frames to capture multi-action contexts in videos. The paper introduces InComNet, a four-stage transformer-based network that leverages CLIP embeddings and a translucent visual prompt to iteratively predict verb predicates and SRVs for persons, objects, and relations, accompanied by a new SSG dataset built atop Action Genome/Charades. Empirical results show that InComNet outperforms baselines on SSG generation, improves downstream situation recognition and predicate classification, and enhances reasoning in human-centric scenarios, with notable gains from iterative refinement and CLIP fine-tuning. The work also demonstrates SSG’s potential for broader tasks such as video QA, dense captioning, and video generation, while acknowledging annotation costs and suggesting semi-supervised approaches to scale.
Abstract
Graph based representation has been widely used in modelling spatio-temporal relationships in video understanding. Although effective, existing graph-based approaches focus on capturing the human-object relationships while ignoring fine-grained semantic properties of the action components. These semantic properties are crucial for understanding the current situation, such as where does the action takes place, what tools are used and functional properties of the objects. In this work, we propose a graph-based representation called Situational Scene Graph (SSG) to encode both human-object relationships and the corresponding semantic properties. The semantic details are represented as predefined roles and values inspired by situation frame, which is originally designed to represent a single action. Based on our proposed representation, we introduce the task of situational scene graph generation and propose a multi-stage pipeline Interactive and Complementary Network (InComNet) to address the task. Given that the existing datasets are not applicable to the task, we further introduce a SSG dataset whose annotations consist of semantic role-value frames for human, objects and verb predicates of human-object relations. Finally, we demonstrate the effectiveness of our proposed SSG representation by testing on different downstream tasks. Experimental results show that the unified representation can not only benefit predicate classification and semantic role-value classification, but also benefit reasoning tasks on human-centric situation understanding. We will release the code and the dataset soon.
