Table of Contents
Fetching ...

Weakly Supervised Video Scene Graph Generation via Natural Language Supervision

Kibum Kim, Kanghoon Yoon, Yeonjun In, Jaehyeong Jeon, Jinyoung Moon, Donghyun Kim, Chanyoung Park

TL;DR

This work addresses the high annotation burden of VidSGG by proposing NL-VSGG, a weakly supervised framework that learns from video captions alone. It introduces two core modules: Temporality-aware Caption Segmentation (TCS) to parse captions into temporally ordered sentences, and Action Duration Variability-aware Caption-Frame Alignment (ADV) to flexibly align sentences to frames using clustering-based supervision. A scene-graph parsing/grounding step yields pseudo-localized graphs, while a Motion-based Pseudo-Labeling (PLM) strategy adds negative action supervision when appropriate. Empirical results on Action Genome show NL-VSGG outperforms naive WS-ImgSGG and PLA-based baselines, with improved open-set action prediction, and benefits from external video-text data and longer videos. Overall, the approach provides a practical path to VidSGG with reduced labeling cost and improved temporal coherence in dynamic relationships.

Abstract

Existing Video Scene Graph Generation (VidSGG) studies are trained in a fully supervised manner, which requires all frames in a video to be annotated, thereby incurring high annotation cost compared to Image Scene Graph Generation (ImgSGG). Although the annotation cost of VidSGG can be alleviated by adopting a weakly supervised approach commonly used for ImgSGG (WS-ImgSGG) that uses image captions, there are two key reasons that hinder such a naive adoption: 1) Temporality within video captions, i.e., unlike image captions, video captions include temporal markers (e.g., before, while, then, after) that indicate time related details, and 2) Variability in action duration, i.e., unlike human actions in image captions, human actions in video captions unfold over varying duration. To address these issues, we propose a Natural Language-based Video Scene Graph Generation (NL-VSGG) framework that only utilizes the readily available video captions for training a VidSGG model. NL-VSGG consists of two key modules: Temporality-aware Caption Segmentation (TCS) module and Action Duration Variability-aware caption-frame alignment (ADV) module. Specifically, TCS segments the video captions into multiple sentences in a temporal order based on a Large Language Model (LLM), and ADV aligns each segmented sentence with appropriate frames considering the variability in action duration. Our approach leads to a significant enhancement in performance compared to simply applying the WS-ImgSGG pipeline to VidSGG on the Action Genome dataset. As a further benefit of utilizing the video captions as weak supervision, we show that the VidSGG model trained by NL-VSGG is able to predict a broader range of action classes that are not included in the training data, which makes our framework practical in reality.

Weakly Supervised Video Scene Graph Generation via Natural Language Supervision

TL;DR

This work addresses the high annotation burden of VidSGG by proposing NL-VSGG, a weakly supervised framework that learns from video captions alone. It introduces two core modules: Temporality-aware Caption Segmentation (TCS) to parse captions into temporally ordered sentences, and Action Duration Variability-aware Caption-Frame Alignment (ADV) to flexibly align sentences to frames using clustering-based supervision. A scene-graph parsing/grounding step yields pseudo-localized graphs, while a Motion-based Pseudo-Labeling (PLM) strategy adds negative action supervision when appropriate. Empirical results on Action Genome show NL-VSGG outperforms naive WS-ImgSGG and PLA-based baselines, with improved open-set action prediction, and benefits from external video-text data and longer videos. Overall, the approach provides a practical path to VidSGG with reduced labeling cost and improved temporal coherence in dynamic relationships.

Abstract

Existing Video Scene Graph Generation (VidSGG) studies are trained in a fully supervised manner, which requires all frames in a video to be annotated, thereby incurring high annotation cost compared to Image Scene Graph Generation (ImgSGG). Although the annotation cost of VidSGG can be alleviated by adopting a weakly supervised approach commonly used for ImgSGG (WS-ImgSGG) that uses image captions, there are two key reasons that hinder such a naive adoption: 1) Temporality within video captions, i.e., unlike image captions, video captions include temporal markers (e.g., before, while, then, after) that indicate time related details, and 2) Variability in action duration, i.e., unlike human actions in image captions, human actions in video captions unfold over varying duration. To address these issues, we propose a Natural Language-based Video Scene Graph Generation (NL-VSGG) framework that only utilizes the readily available video captions for training a VidSGG model. NL-VSGG consists of two key modules: Temporality-aware Caption Segmentation (TCS) module and Action Duration Variability-aware caption-frame alignment (ADV) module. Specifically, TCS segments the video captions into multiple sentences in a temporal order based on a Large Language Model (LLM), and ADV aligns each segmented sentence with appropriate frames considering the variability in action duration. Our approach leads to a significant enhancement in performance compared to simply applying the WS-ImgSGG pipeline to VidSGG on the Action Genome dataset. As a further benefit of utilizing the video captions as weak supervision, we show that the VidSGG model trained by NL-VSGG is able to predict a broader range of action classes that are not included in the training data, which makes our framework practical in reality.

Paper Structure

This paper contains 37 sections, 12 figures, 14 tables.

Figures (12)

  • Figure 1: (a) The fully supervised VidSGG requires costly localized scene graphs across all frames. (b) The pipeline of WS-ImgSGG. (c) The pipeline of WS-VidSGG needs to consider the temporality within the caption addressed by temporal segmentation and the variability of action duration addressed by temporal alignment.
  • Figure 2: Ratio of temporal markers.
  • Figure 3: The overall framework of NL-VSGG. With an input video and its caption, (a) we employ the TCS module to segment the input video caption into sentences based on temporality. (b) In the ADV module, each segmented sentence is aligned with appropriate frames considering the variability in action duration. (c) The segmented sentences are then parsed and grounded to generate pseudo-localized scene graphs. (d) Furthermore, we assign negative classes based on the motion cues within unaligned frames. (e) Utilizing the pseudo-localized scene graphs and pseudo-labeled negative classes, we then train a VidSGG model.
  • Figure 4: Example of motion cue.
  • Figure 5: Qualitative results of NL-VSGG for broader range of action classes. The red-colored texts indicate predicates with novel meanings that are not present in the AG dataset.
  • ...and 7 more figures