Table of Contents
Fetching ...

A Fair Ranking and New Model for Panoptic Scene Graph Generation

Julian Lorenz, Alexander Pest, Daniel Kienzle, Katja Ludwig, Rainer Lienhart

TL;DR

The paper identifies critical flaws in the PSGG evaluation protocol (MultiMPO) that permit duplicate masks and multiple predicate distributions to inflate scores. It advocates a fair SingleMPO protocol and re-evaluates existing PSGG methods, revealing that two-stage approaches outperform one-stage methods when measured fairly. It introduces DSFormer, a decoupled two-stage model that encodes subject and object masks directly into transformer features using specialized tokens and losses, achieving state-of-the-art performance on $mR@50$ and $mNgR@50$ (e.g., $mR@50$ of 30.67 and $mNgR@50$ of 50.08) and demonstrating strong gains over prior PSGG models. The work underscores the crucial impact of the segmentation model quality on PSGG performance and advocates adopting SingleMPO for fair comparisons while highlighting the practical benefits of decoupled, segmentation-aware two-stage methods for scalable PSGG.

Abstract

In panoptic scene graph generation (PSGG), models retrieve interactions between objects in an image which are grounded by panoptic segmentation masks. Previous evaluations on panoptic scene graphs have been subject to an erroneous evaluation protocol where multiple masks for the same object can lead to multiple relation distributions per mask-mask pair. This can be exploited to increase the final score. We correct this flaw and provide a fair ranking over a wide range of existing PSGG models. The observed scores for existing methods increase by up to 7.4 mR@50 for all two-stage methods, while dropping by up to 19.3 mR@50 for all one-stage methods, highlighting the importance of a correct evaluation. Contrary to recent publications, we show that existing two-stage methods are competitive to one-stage methods. Building on this, we introduce the Decoupled SceneFormer (DSFormer), a novel two-stage model that outperforms all existing scene graph models by a large margin of +11 mR@50 and +10 mNgR@50 on the corrected evaluation, thus setting a new SOTA. As a core design principle, DSFormer encodes subject and object masks directly into feature space.

A Fair Ranking and New Model for Panoptic Scene Graph Generation

TL;DR

The paper identifies critical flaws in the PSGG evaluation protocol (MultiMPO) that permit duplicate masks and multiple predicate distributions to inflate scores. It advocates a fair SingleMPO protocol and re-evaluates existing PSGG methods, revealing that two-stage approaches outperform one-stage methods when measured fairly. It introduces DSFormer, a decoupled two-stage model that encodes subject and object masks directly into transformer features using specialized tokens and losses, achieving state-of-the-art performance on and (e.g., of 30.67 and of 50.08) and demonstrating strong gains over prior PSGG models. The work underscores the crucial impact of the segmentation model quality on PSGG performance and advocates adopting SingleMPO for fair comparisons while highlighting the practical benefits of decoupled, segmentation-aware two-stage methods for scalable PSGG.

Abstract

In panoptic scene graph generation (PSGG), models retrieve interactions between objects in an image which are grounded by panoptic segmentation masks. Previous evaluations on panoptic scene graphs have been subject to an erroneous evaluation protocol where multiple masks for the same object can lead to multiple relation distributions per mask-mask pair. This can be exploited to increase the final score. We correct this flaw and provide a fair ranking over a wide range of existing PSGG models. The observed scores for existing methods increase by up to 7.4 mR@50 for all two-stage methods, while dropping by up to 19.3 mR@50 for all one-stage methods, highlighting the importance of a correct evaluation. Contrary to recent publications, we show that existing two-stage methods are competitive to one-stage methods. Building on this, we introduce the Decoupled SceneFormer (DSFormer), a novel two-stage model that outperforms all existing scene graph models by a large margin of +11 mR@50 and +10 mNgR@50 on the corrected evaluation, thus setting a new SOTA. As a core design principle, DSFormer encodes subject and object masks directly into feature space.
Paper Structure (30 sections, 5 equations, 13 figures, 6 tables, 3 algorithms)

This paper contains 30 sections, 5 equations, 13 figures, 6 tables, 3 algorithms.

Figures (13)

  • Figure 1: Schematic comparison of the output from existing one-stage methods (e.g. HiLo, Fig. B) to our proposed two-stage method (Fig. C). One-stage methods often output multiple masks per real world object, visualized with colored masks in Fig. B. This results in one predicate score distribution per mask-mask pair but multiple distributions for pairs that share the same ground truth subject and object. In current evaluation implementations, multiple masks or relations are not aggregated and can therefore be exploited to increase mR@k scores. Our new method does not have this flaw.
  • Figure 2: Schematic comparison of the two considered evaluation protocols. (A) The ground truth has a single mask per subject/object. (B) There are three different masks for "person" and two for "chair". Keeping them, all ground truth is covered and a recall of 100% is computed by MultiMPO, even though the hypothetical model in this example is much more confident with returning person-eating-bottle instead of person-drinking-bottle and person-driving-chair instead of person-on-chair (C) Enforcing a single mask per subject/object and a single predicate distribution per subject-object pair reveals the error in predicting the most probable relation.
  • Figure 3: Our proposed architecture for DSFormer. In a forward pass, the model requires an image, subject and object class, and segmentation masks for subject and object. During training, ground truth data is used. During evaluation, segmentation masks and class labels are inferred from a capable segmentation model. DSFormer outputs a relation prediction as well as an auxiliary subject and object class prediction which are only used during training. \ref{['fig:arch_tokens']} shows how the different tokens that enter the transformer module are derived.
  • Figure 4: Most tokens for our proposed model are derived from the segmentation masks. In a patch token, the overlapping ratio of subject and object mask are encoded by adding a weighted sum over learnable subject, object, and background tokens to the initial feature patch. The location token is inferred from the normalized bounding boxes of subject and object using a two-layer MLP. The semantic token is derived directly from subject and object class via a learnable embedding that returns a unique vector for each unique subject-object class combination.
  • Figure 5: Comparison of achieved mR@50 scores with: (1) originally published unfair MultiMPO, (2) our newly introduced fair SingleMPO, and (3) a modification of two-stage methods that uses a better mask model and exploits MultiMPO similar to some one-stage methods. Even though all methods are evaluated equally, mR@50 scores for all one-stage methods decline with a maximum decrease of 19.3 for SingleMPO.
  • ...and 8 more figures