Table of Contents
Fetching ...

TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation

Xin Lin, Chong Shi, Yibing Zhan, Zuopeng Yang, Yaqi Wu, Dacheng Tao

TL;DR

Dynamic scene graph generation in videos is challenged by contextual noise from occluded or blurred objects and by label bias due to imbalanced predicate distributions. TD^2-Net tackles these issues with a Denoising Spatio-Temporal Transformer (D-Trans) that uses a differentiable Top-K object selector (via Gumbel-Softmax) to obtain robust contextual neighborhoods, and an Asymmetrical Reweighting Loss (AR-Loss) that decouples positive/negative focusing and leverages the effective number of samples to address head-tail imbalance. The method demonstrates strong improvements on Action Genome, achieving state-of-the-art results and a notable 12.7% gain in mean recall at R@10 for PREDCLS under with-constraint settings, without sacrificing recall. These advances provide a practical path toward more robust and fair VidSGG models in real-world video understanding tasks.

Abstract

Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones. Additionally, the distribution of relationships exhibits a long-tailed pattern. To address the above problems, in this paper, we introduce a network named TD$^2$-Net that aims at denoising and debiasing for dynamic SGG. Specifically, we first propose a denoising spatio-temporal transformer module that enhances object representation with robust contextual information. This is achieved by designing a differentiable Top-K object selector that utilizes the gumbel-softmax sampling strategy to select the relevant neighborhood for each object. Second, we introduce an asymmetrical reweighting loss to relieve the issue of label bias. This loss function integrates asymmetry focusing factors and the volume of samples to adjust the weights assigned to individual samples. Systematic experimental results demonstrate the superiority of our proposed TD$^2$-Net over existing state-of-the-art approaches on Action Genome databases. In more detail, TD$^2$-Net outperforms the second-best competitors by 12.7 \% on mean-Recall@10 for predicate classification.

TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation

TL;DR

Dynamic scene graph generation in videos is challenged by contextual noise from occluded or blurred objects and by label bias due to imbalanced predicate distributions. TD^2-Net tackles these issues with a Denoising Spatio-Temporal Transformer (D-Trans) that uses a differentiable Top-K object selector (via Gumbel-Softmax) to obtain robust contextual neighborhoods, and an Asymmetrical Reweighting Loss (AR-Loss) that decouples positive/negative focusing and leverages the effective number of samples to address head-tail imbalance. The method demonstrates strong improvements on Action Genome, achieving state-of-the-art results and a notable 12.7% gain in mean recall at R@10 for PREDCLS under with-constraint settings, without sacrificing recall. These advances provide a practical path toward more robust and fair VidSGG models in real-world video understanding tasks.

Abstract

Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones. Additionally, the distribution of relationships exhibits a long-tailed pattern. To address the above problems, in this paper, we introduce a network named TD-Net that aims at denoising and debiasing for dynamic SGG. Specifically, we first propose a denoising spatio-temporal transformer module that enhances object representation with robust contextual information. This is achieved by designing a differentiable Top-K object selector that utilizes the gumbel-softmax sampling strategy to select the relevant neighborhood for each object. Second, we introduce an asymmetrical reweighting loss to relieve the issue of label bias. This loss function integrates asymmetry focusing factors and the volume of samples to adjust the weights assigned to individual samples. Systematic experimental results demonstrate the superiority of our proposed TD-Net over existing state-of-the-art approaches on Action Genome databases. In more detail, TD-Net outperforms the second-best competitors by 12.7 \% on mean-Recall@10 for predicate classification.
Paper Structure (16 sections, 10 equations, 4 figures, 6 tables)

This paper contains 16 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a) Contextual noise. A significant proportion of objects may be occluded or affected by camera motion blur. (b) Label bias. As shown in the left example of two objects, the quantity of positive relationship labels is significantly less than that of negative ones, causing a negative-positive imbalance. Furthermore, as shown in the right tabular, the distribution of relationships exhibits a long-tailed trend.
  • Figure 2: The framework of TD$^2$-Net. TD$^2$-Net adopts Faster-RCNN to generate initial object proposals for each RGB frame in a video. It includes two new modules for dynamic scene graph generation: (1) a novel transformer module named D-Trans that enhances object feature with robust contextual information (2) a new loss function named AR-Loss that takes into account both positive-negative imbalance and head-tail imbalance in relationship prediction.
  • Figure 3: Comparative per class performance for PREDCLS task. Results are in terms of R@10 under With Constraint.
  • Figure 4: Qualitative comparisons between TD$^2$-Net and STTran cong2021spatial. Specifically, we show the comparisons at R@100 in the SGCLS setting. The black color indicates correctly classified objects or predicates; the red indicates those that have been misclassified. Best viewed in color.