Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing
Yanjun Li, Zhaoyang Li, Honghui Chen, Lizhi Xu
TL;DR
This paper tackles biases in video scene graph generation by explicitly separating and mitigating visual and semantic biases. It introduces VISA, a dual debiasing framework that combines memory-guided temporal integration (MGSM) for visual debiasing with an Iterative Relation Generator (IRG) and Hierarchical Semantics Extractor (HSE) for semantic debiasing, enabling robust extraction of dynamic relationships across frames. The method delivers state-of-the-art unbiased VidSGG results on Action Genome, including substantial gains in SGCLS under Semi Constraint, and demonstrates the importance of integrating visual and semantic cues through memory and context. The work has practical impact for reliable scene understanding in videos and lays groundwork for extending visual–semantic debiasing to broader multimodal perception tasks.
Abstract
Video Scene Graph Generation (VidSGG) aims to capture dynamic relationships among entities by sequentially analyzing video frames and integrating visual and semantic information. However, VidSGG is challenged by significant biases that skew predictions. To mitigate these biases, we propose a VIsual and Semantic Awareness (VISA) framework for unbiased VidSGG. VISA addresses visual bias through memory-enhanced temporal integration that enhances object representations and concurrently reduces semantic bias by iteratively integrating object features with comprehensive semantic information derived from triplet relationships. This visual-semantics dual debiasing approach results in more unbiased representations of complex scene dynamics. Extensive experiments demonstrate the effectiveness of our method, where VISA outperforms existing unbiased VidSGG approaches by a substantial margin (e.g., +13.1% improvement in mR@20 and mR@50 for the SGCLS task under Semi Constraint).
