Table of Contents
Fetching ...

DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime

Julian Lorenz, Vladyslav Kovganko, Elias Kohout, Mrunmai Phatak, Daniel Kienzle, Rainer Lienhart

TL;DR

DSFlash is a low-latency model for panoptic scene graph generation designed to overcome limitations in speed and resource efficiency, and unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency.

Abstract

Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research. To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU. This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.

DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime

TL;DR

DSFlash is a low-latency model for panoptic scene graph generation designed to overcome limitations in speed and resource efficiency, and unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency.

Abstract

Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research. To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU. This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.
Paper Structure (29 sections, 3 equations, 14 figures, 5 tables, 2 algorithms)

This paper contains 29 sections, 3 equations, 14 figures, 5 tables, 2 algorithms.

Figures (14)

  • Figure 1: Performance comparison between our approach and previous work in terms of performance (mR@50) and latency (ms) on the PSG dataset psg.
  • Figure 2: Overview of the DSFlash architecture for inference. Part A is executed once per image. Part B is executed for each combination of two segmentation masks. We use EoMT eomt as the segmentation backbone which is kept frozen throughout the whole training. We use the mask embedding module from DSFormer dsformer. The relation predictor head is described in \ref{['sec:bidir']}. The red numbers indicate which components are covered in which section.
  • Figure 3: Qualitative comparison of the segmentation masks produced by EoMT with and without upsampling the logits.
  • Figure 4: Schematic of DSFlash's gating mechanism and the enforced feature consistency loss during training. Given two segmentation masks and an image, DSFlash computes a class token $x$ using various modules, summarized here as $\Phi$. To train the consistency loss, DSFlash performs two forward passes through the model head with flipped segmentation masks.
  • Figure 5: Impact of the segmentation model's capability on the final performance. mR@infsgbench is the best possible mR@k that a hypothetical perfect PSGG model could achieve, given the extracted segmentation masks from the segmentation model.
  • ...and 9 more figures