Weakly Supervised Video Scene Graph Generation via Natural Language Supervision

Kibum Kim; Kanghoon Yoon; Yeonjun In; Jaehyeong Jeon; Jinyoung Moon; Donghyun Kim; Chanyoung Park

Weakly Supervised Video Scene Graph Generation via Natural Language Supervision

Kibum Kim, Kanghoon Yoon, Yeonjun In, Jaehyeong Jeon, Jinyoung Moon, Donghyun Kim, Chanyoung Park

TL;DR

This work addresses the high annotation burden of VidSGG by proposing NL-VSGG, a weakly supervised framework that learns from video captions alone. It introduces two core modules: Temporality-aware Caption Segmentation (TCS) to parse captions into temporally ordered sentences, and Action Duration Variability-aware Caption-Frame Alignment (ADV) to flexibly align sentences to frames using clustering-based supervision. A scene-graph parsing/grounding step yields pseudo-localized graphs, while a Motion-based Pseudo-Labeling (PLM) strategy adds negative action supervision when appropriate. Empirical results on Action Genome show NL-VSGG outperforms naive WS-ImgSGG and PLA-based baselines, with improved open-set action prediction, and benefits from external video-text data and longer videos. Overall, the approach provides a practical path to VidSGG with reduced labeling cost and improved temporal coherence in dynamic relationships.

Abstract

Existing Video Scene Graph Generation (VidSGG) studies are trained in a fully supervised manner, which requires all frames in a video to be annotated, thereby incurring high annotation cost compared to Image Scene Graph Generation (ImgSGG). Although the annotation cost of VidSGG can be alleviated by adopting a weakly supervised approach commonly used for ImgSGG (WS-ImgSGG) that uses image captions, there are two key reasons that hinder such a naive adoption: 1) Temporality within video captions, i.e., unlike image captions, video captions include temporal markers (e.g., before, while, then, after) that indicate time related details, and 2) Variability in action duration, i.e., unlike human actions in image captions, human actions in video captions unfold over varying duration. To address these issues, we propose a Natural Language-based Video Scene Graph Generation (NL-VSGG) framework that only utilizes the readily available video captions for training a VidSGG model. NL-VSGG consists of two key modules: Temporality-aware Caption Segmentation (TCS) module and Action Duration Variability-aware caption-frame alignment (ADV) module. Specifically, TCS segments the video captions into multiple sentences in a temporal order based on a Large Language Model (LLM), and ADV aligns each segmented sentence with appropriate frames considering the variability in action duration. Our approach leads to a significant enhancement in performance compared to simply applying the WS-ImgSGG pipeline to VidSGG on the Action Genome dataset. As a further benefit of utilizing the video captions as weak supervision, we show that the VidSGG model trained by NL-VSGG is able to predict a broader range of action classes that are not included in the training data, which makes our framework practical in reality.

Weakly Supervised Video Scene Graph Generation via Natural Language Supervision

TL;DR

Abstract

Weakly Supervised Video Scene Graph Generation via Natural Language Supervision

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)