TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation
Xin Lin, Chong Shi, Yibing Zhan, Zuopeng Yang, Yaqi Wu, Dacheng Tao
TL;DR
Dynamic scene graph generation in videos is challenged by contextual noise from occluded or blurred objects and by label bias due to imbalanced predicate distributions. TD^2-Net tackles these issues with a Denoising Spatio-Temporal Transformer (D-Trans) that uses a differentiable Top-K object selector (via Gumbel-Softmax) to obtain robust contextual neighborhoods, and an Asymmetrical Reweighting Loss (AR-Loss) that decouples positive/negative focusing and leverages the effective number of samples to address head-tail imbalance. The method demonstrates strong improvements on Action Genome, achieving state-of-the-art results and a notable 12.7% gain in mean recall at R@10 for PREDCLS under with-constraint settings, without sacrificing recall. These advances provide a practical path toward more robust and fair VidSGG models in real-world video understanding tasks.
Abstract
Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones. Additionally, the distribution of relationships exhibits a long-tailed pattern. To address the above problems, in this paper, we introduce a network named TD$^2$-Net that aims at denoising and debiasing for dynamic SGG. Specifically, we first propose a denoising spatio-temporal transformer module that enhances object representation with robust contextual information. This is achieved by designing a differentiable Top-K object selector that utilizes the gumbel-softmax sampling strategy to select the relevant neighborhood for each object. Second, we introduce an asymmetrical reweighting loss to relieve the issue of label bias. This loss function integrates asymmetry focusing factors and the volume of samples to adjust the weights assigned to individual samples. Systematic experimental results demonstrate the superiority of our proposed TD$^2$-Net over existing state-of-the-art approaches on Action Genome databases. In more detail, TD$^2$-Net outperforms the second-best competitors by 12.7 \% on mean-Recall@10 for predicate classification.
