Table of Contents
Fetching ...

A Review and Efficient Implementation of Scene Graph Generation Metrics

Julian Lorenz, Robin Schön, Katja Ludwig, Rainer Lienhart

TL;DR

The paper addresses the lack of precise, formal definitions for scene graph generation metrics by providing a rigorous metric framework and accompanying pseudocode. It introduces SGBench, a lightweight, dependency-minimal Python package that implements all defined metrics, alongside a public benchmarking service to compare PSGG methods on a central platform. Through exhaustive experiments on panoptic scene graph methods, the authors demonstrate clearer, more reproducible evaluations and show how standardized metrics can illuminate method strengths, limitations, and trade-offs. This work enables reproducible benchmarking, accelerates method development, and promotes visibility of new PSGG approaches in a centralized, accessible manner.

Abstract

Scene graph generation has emerged as a prominent research field in computer vision, witnessing significant advancements in the recent years. However, despite these strides, precise and thorough definitions for the metrics used to evaluate scene graph generation models are lacking. In this paper, we address this gap in the literature by providing a review and precise definition of commonly used metrics in scene graph generation. Our comprehensive examination clarifies the underlying principles of these metrics and can serve as a reference or introduction to scene graph metrics. Furthermore, to facilitate the usage of these metrics, we introduce a standalone Python package called SGBench that efficiently implements all defined metrics, ensuring their accessibility to the research community. Additionally, we present a scene graph benchmarking web service, that enables researchers to compare scene graph generation methods and increase visibility of new methods in a central place. All of our code can be found at https://lorjul.github.io/sgbench/.

A Review and Efficient Implementation of Scene Graph Generation Metrics

TL;DR

The paper addresses the lack of precise, formal definitions for scene graph generation metrics by providing a rigorous metric framework and accompanying pseudocode. It introduces SGBench, a lightweight, dependency-minimal Python package that implements all defined metrics, alongside a public benchmarking service to compare PSGG methods on a central platform. Through exhaustive experiments on panoptic scene graph methods, the authors demonstrate clearer, more reproducible evaluations and show how standardized metrics can illuminate method strengths, limitations, and trade-offs. This work enables reproducible benchmarking, accelerates method development, and promotes visibility of new PSGG approaches in a centralized, accessible manner.

Abstract

Scene graph generation has emerged as a prominent research field in computer vision, witnessing significant advancements in the recent years. However, despite these strides, precise and thorough definitions for the metrics used to evaluate scene graph generation models are lacking. In this paper, we address this gap in the literature by providing a review and precise definition of commonly used metrics in scene graph generation. Our comprehensive examination clarifies the underlying principles of these metrics and can serve as a reference or introduction to scene graph metrics. Furthermore, to facilitate the usage of these metrics, we introduce a standalone Python package called SGBench that efficiently implements all defined metrics, ensuring their accessibility to the research community. Additionally, we present a scene graph benchmarking web service, that enables researchers to compare scene graph generation methods and increase visibility of new methods in a central place. All of our code can be found at https://lorjul.github.io/sgbench/.
Paper Structure (39 sections, 4 equations, 3 figures, 3 tables, 6 algorithms)

This paper contains 39 sections, 4 equations, 3 figures, 3 tables, 6 algorithms.

Figures (3)

  • Figure 1: Screenshot of the benchmarking service interface.
  • Figure 2: Absolute vs relative choice of $k$ for $R@k$ scores. $R@50$ and $R@{\times}10$ are correlated with a correlation factor of about 0.9998. This shows that both choices of $k$ are equally suited for evaluation. However, a relative $k$ is independent of the dataset.
  • Figure 3: Pair Recall@50 ($PR@50$) compared to Predicate Rank ($PRank$). A higher $PR@50$ and a lower $PRank$ is better. A better Predicate Rank does not necessarily result in a better Pair Recall. For example PSGTR has a better $PRank$ than HiLo but a worse PR@50.