Table of Contents
Fetching ...

Location-Free Scene Graph Generation

Ege Özsoy, Felix Holm, Mahdi Saleh, Tobias Czempiel, Chantal Pellegrini, Nassir Navab, Benjamin Busam

TL;DR

This work introduces location-free scene graph generation (LF-SGG), removing the need for bounding boxes or masks in both training and evaluation. It proposes Pix2SG, an autoregressive transformer that directly generates a scene graph from an image by encoding entity classes, instances, and predicates into a token sequence, coupled with a novel heuristic tree-search graph matching for objective evaluation. The approach is validated across PSG, Visual Genome, and 4D-OR, and demonstrates strong performance on downstream tasks such as image retrieval and zero-shot VQA, illustrating practical utility without location cues. The paper also provides a dedicated evaluation framework and ablations, showing that LF-SGG can approach, and in some domains surpass, location-based methods while drastically reducing annotation overhead.

Abstract

Scene Graph Generation (SGG) is a visual understanding task, aiming to describe a scene as a graph of entities and their relationships with each other. Existing works rely on location labels in form of bounding boxes or segmentation masks, increasing annotation costs and limiting dataset expansion. Recognizing that many applications do not require location data, we break this dependency and introduce location-free scene graph generation (LF-SGG). This new task aims at predicting instances of entities, as well as their relationships, without the explicit calculation of their spatial localization. To objectively evaluate the task, the predicted and ground truth scene graphs need to be compared. We solve this NP-hard problem through an efficient branching algorithm. Additionally, we design the first LF-SGG method, Pix2SG, using autoregressive sequence modeling. We demonstrate the effectiveness of our method on three scene graph generation datasets as well as two downstream tasks, image retrieval and visual question answering, and show that our approach is competitive to existing methods while not relying on location cues.

Location-Free Scene Graph Generation

TL;DR

This work introduces location-free scene graph generation (LF-SGG), removing the need for bounding boxes or masks in both training and evaluation. It proposes Pix2SG, an autoregressive transformer that directly generates a scene graph from an image by encoding entity classes, instances, and predicates into a token sequence, coupled with a novel heuristic tree-search graph matching for objective evaluation. The approach is validated across PSG, Visual Genome, and 4D-OR, and demonstrates strong performance on downstream tasks such as image retrieval and zero-shot VQA, illustrating practical utility without location cues. The paper also provides a dedicated evaluation framework and ablations, showing that LF-SGG can approach, and in some domains surpass, location-based methods while drastically reducing annotation overhead.

Abstract

Scene Graph Generation (SGG) is a visual understanding task, aiming to describe a scene as a graph of entities and their relationships with each other. Existing works rely on location labels in form of bounding boxes or segmentation masks, increasing annotation costs and limiting dataset expansion. Recognizing that many applications do not require location data, we break this dependency and introduce location-free scene graph generation (LF-SGG). This new task aims at predicting instances of entities, as well as their relationships, without the explicit calculation of their spatial localization. To objectively evaluate the task, the predicted and ground truth scene graphs need to be compared. We solve this NP-hard problem through an efficient branching algorithm. Additionally, we design the first LF-SGG method, Pix2SG, using autoregressive sequence modeling. We demonstrate the effectiveness of our method on three scene graph generation datasets as well as two downstream tasks, image retrieval and visual question answering, and show that our approach is competitive to existing methods while not relying on location cues.
Paper Structure (25 sections, 2 equations, 15 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 2 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: We introduce the new task of location-free scene graph generation (LF-SGG), completely removing the requirement for expensive bounding box or segmentation mask annotations for scene graph datasets. We further introduce Pix2SG, a method leveraging autoregressive language modeling for congruent scene graph predictions, and a heuristic tree search algorithm for scene graph matching necessary for evaluation.
  • Figure 2: Conversion of existing location-based scene graph annotations to location-free scene graphs with instance identification and mapping to the graph representation.
  • Figure 3: Pix2SG Architecture: An image encoder encodes the image as a feature map that is flattened and used as the input sequence to the autoregressive transformer module. The autoregressive transformer predicts the components of the scene graph, token by token, considering all its previous predictions until the output SG-sequence is completed.
  • Figure 4: Illustration of the scene graph matching problem. Ground truth scene graph and prediction have to be correctly matched for the evaluation. A suboptimal matching can obscure the model actual performance.
  • Figure 5: Qualitative Results of Pix2SG on the Panoptic Scene Graph Dataset. Images and corresponding Ground Truth Scene Graphs are shown. Nodes and edges correctly predicted by our model are highlighted in green. Additional triplets are predicted which are not in the ground truth but meaningful.
  • ...and 10 more figures