Table of Contents
Fetching ...

BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection

Melissa Schween, Mathis Kruse, Bodo Rosenhahn

Abstract

We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world. A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation. We evaluate our approach on the SARD dataset containing office and dining room scenes. Our method achieves around 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster. Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation. This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs. Our code is available at https://github.com/mschween/BUSSARD .

BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection

Abstract

We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world. A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation. We evaluate our approach on the SARD dataset containing office and dining room scenes. Our method achieves around 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster. Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation. This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs. Our code is available at https://github.com/mschween/BUSSARD .
Paper Structure (30 sections, 11 equations, 12 figures, 6 tables)

This paper contains 30 sections, 11 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Example image from SARD dataset lai2025scene. A non complete scene graph consists of: 'plate-on-chair', 'plate-near-clock', 'cup-on-table', 'chair-near-table'. The anomaly to detect is 'plate-on-chair'.
  • Figure 2: The components of BUSSARD. The images are parsed using a pretrained scene graph generator. Each triplet is then encoded using a pretrained word embedding model. The embeddings of the triplets are each concatenated and the dimension is reduced using an autoencoder. In the end, a normalizing flow is used to predict the likelihood of the triplets being anomalous.
  • Figure 3: The 40 most frequent triplets of the dining room scene. The labels belong to the highlighted bars, showing example triplets.
  • Figure 4: Ablation results with AUROC ($\uparrow$) and AUC-Recall@k ($\uparrow$) of BUSSARD and SARD-c for different synonym rates for the dining room scene. The synonym rate represents the probability of substituting words using synonym mappings. For BUSSARD the dots represent the average results after running with ten different seeds, and the shaded area visualizes the corresponding standard deviation. SARD-c was run only once for each rate as the calculation is deterministic.
  • Figure 5: Ablation results with the AUROC ($\uparrow$) for different latent space dimensions of the autoencoder.
  • ...and 7 more figures