Table of Contents
Fetching ...

Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation

Rohith Peddi, Saurabh, Ayush Abhay Shrivastava, Parag Singla, Vibhav Gogate

TL;DR

This work tackles the long-tail bias and distribution-shift challenges in Spatio-Temporal Scene Graph Generation and Anticipation (STSG) by introducing ImparTail, a loss-masking, curriculum-guided training framework that emphasizes tail predicate learning without altering model architecture. By replacing full predicate-loss with a curriculum-guided masked loss and progressively balancing the predicate distribution, ImparTail mitigates head-class dominance while maintaining head-class performance. The authors also propose Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation to benchmark resilience under real-world corruptions. Empirical results on Action Genome show significant improvements in mean recall for VidSGG and SGA, along with enhanced robustness under diverse input corruptions, demonstrating practical gains for unbiased and reliable STSG systems.

Abstract

Spatio-Temporal Scene Graphs (STSGs) provide a concise and expressive representation of dynamic scenes by modeling objects and their evolving relationships over time. However, real-world visual relationships often exhibit a long-tailed distribution, causing existing methods for tasks like Video Scene Graph Generation (VidSGG) and Scene Graph Anticipation (SGA) to produce biased scene graphs. To this end, we propose ImparTail, a novel training framework that leverages loss masking and curriculum learning to mitigate bias in the generation and anticipation of spatio-temporal scene graphs. Unlike prior methods that add extra architectural components to learn unbiased estimators, we propose an impartial training objective that reduces the dominance of head classes during learning and focuses on underrepresented tail relationships. Our curriculum-driven mask generation strategy further empowers the model to adaptively adjust its bias mitigation strategy over time, enabling more balanced and robust estimations. To thoroughly assess performance under various distribution shifts, we also introduce two new tasks Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation offering a challenging benchmark for evaluating the resilience of STSG models. Extensive experiments on the Action Genome dataset demonstrate the superior unbiased performance and robustness of our method compared to existing baselines.

Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation

TL;DR

This work tackles the long-tail bias and distribution-shift challenges in Spatio-Temporal Scene Graph Generation and Anticipation (STSG) by introducing ImparTail, a loss-masking, curriculum-guided training framework that emphasizes tail predicate learning without altering model architecture. By replacing full predicate-loss with a curriculum-guided masked loss and progressively balancing the predicate distribution, ImparTail mitigates head-class dominance while maintaining head-class performance. The authors also propose Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation to benchmark resilience under real-world corruptions. Empirical results on Action Genome show significant improvements in mean recall for VidSGG and SGA, along with enhanced robustness under diverse input corruptions, demonstrating practical gains for unbiased and reliable STSG systems.

Abstract

Spatio-Temporal Scene Graphs (STSGs) provide a concise and expressive representation of dynamic scenes by modeling objects and their evolving relationships over time. However, real-world visual relationships often exhibit a long-tailed distribution, causing existing methods for tasks like Video Scene Graph Generation (VidSGG) and Scene Graph Anticipation (SGA) to produce biased scene graphs. To this end, we propose ImparTail, a novel training framework that leverages loss masking and curriculum learning to mitigate bias in the generation and anticipation of spatio-temporal scene graphs. Unlike prior methods that add extra architectural components to learn unbiased estimators, we propose an impartial training objective that reduces the dominance of head classes during learning and focuses on underrepresented tail relationships. Our curriculum-driven mask generation strategy further empowers the model to adaptively adjust its bias mitigation strategy over time, enabling more balanced and robust estimations. To thoroughly assess performance under various distribution shifts, we also introduce two new tasks Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation offering a challenging benchmark for evaluating the resilience of STSG models. Extensive experiments on the Action Genome dataset demonstrate the superior unbiased performance and robustness of our method compared to existing baselines.

Paper Structure

This paper contains 83 sections, 45 equations, 10 figures, 34 tables, 3 algorithms.

Figures (10)

  • Figure 1: Overview.Row-1: Existing training pipelines in the literature for VidSGG/SGA tasks. Row-2: Prior unbiased learning work, exemplified by the supplementary architectural modules and loss functions. Row-3 (ImparTail): A framework through which any prior object-centric, representation learning–based VidSGG/SGA method can be adapted to learn corresponding unbiased estimator.
  • Figure 2: (a) Long Tailed Distribution. Predicates in Spatio-Temporal Scene Graph (STSG) datasets exhibit a long-tailed distribution; one such example is the Action Genome Ji_2019 dataset, whose distribution is described at the top left. (b) Tasks. We focus on two STSG tasks, including Video Scene Graph Generation (VidSGG) on the left and Scene Graph Anticipation (SGA) on the right. VidSGG entails the identification of fine-grained relationships between the objects observed in the video, such as (Person, looking_at, Paper Notebook) and (Person, not_looking_at, Paper Notebook) in respective frames to the left. SGA aims to anticipate the evolution of these relationships to (Person, touching, Cup), and eventually, (Person, drinking_from, Cup) peddi_et_al_scene_sayer_2024. (c) Conventional Learning. Due to the inherent long-tailed distribution of these datasets, models learnt using the conventional approaches focus more on the head classes and perform poorly on the tail classes as illustrated using the prediction scores of STTran cong_et_al_sttran_2021 on contacting and attention relationships (refer middle row). (d) Unbiased Learning. To alleviate the dominance of head classes during training, in unbiased learning, we focus more on the tail classes, ensuring that the learnt models exhibit significantly better performance in predicting both head and tail classes (refer bottom row).
  • Figure 3: Overview of ImparTail(a) Pipeline. The forward pass of ImparTail begins with an ORPU, where initial object proposals are generated for each observed frame. These object representations are then fed to STPUs designed to construct spatio-temporal context-aware relationship representations of interacting objects. ImparTail applied to both tasks VidSGG and SGA remains mostly the same, with an additional LDPU unit added for SGA to anticipate relationship representations for future frames. These observed (for VidSGG)/ anticipated (for SGA) relationship representations are then decoded to construct STSGs. (b) Conventional Training. Previous approaches estimated loss for all relationship predicates (head and tail classes). (c) Masked Training. With the inherent long-tailed nature of the STSG datasets, conventional training results in biased VidSGG and SGA models. Thus, to de-bias the training and learn an unbiased model, in ImparTail, instead of estimating loss for all relationship predicates, we estimate a masked loss, where we selectively mask the labels corresponding to dominant head classes and void their contribution in learning. (d) Curriculum-Guided Mask Generation. In ImparTail, we introduce a curriculum-based approach for masking relationship predicate labels during training. At each iteration, we adjust the selection of masked predicates to balance the class distribution progressively. As illustrated, initially, the model trains on the original, long-tailed distribution. As training advances, we systematically mask predicate labels from the head classes, gradually shifting the distribution toward uniformity.
  • Figure 4: (a) Robustness Evaluation Pipeline: We present a methodology to assess the robustness of trained VidSGG and SGA models when faced with input distribution shifts. Specifically, we systematically introduce corruptions to the frames of test videos, which are then used as inputs for the trained models. (b) Corrupted Frames: We illustrate the frames obtained by inducing various categories of corruptions (see Appendix Sec.4 for details).
  • Figure 5: Predicate Classification recall performance. of models trained using existing VidSGG methods STTran, DSGDetr, and models trained using their ImparTail adaptations. In each row, we compare the R@10/50 performance of each relationship category without corruptions(left) and with corruptions(right) in input data).
  • ...and 5 more figures