Location-Free Scene Graph Generation
Ege Özsoy, Felix Holm, Mahdi Saleh, Tobias Czempiel, Chantal Pellegrini, Nassir Navab, Benjamin Busam
TL;DR
This work introduces location-free scene graph generation (LF-SGG), removing the need for bounding boxes or masks in both training and evaluation. It proposes Pix2SG, an autoregressive transformer that directly generates a scene graph from an image by encoding entity classes, instances, and predicates into a token sequence, coupled with a novel heuristic tree-search graph matching for objective evaluation. The approach is validated across PSG, Visual Genome, and 4D-OR, and demonstrates strong performance on downstream tasks such as image retrieval and zero-shot VQA, illustrating practical utility without location cues. The paper also provides a dedicated evaluation framework and ablations, showing that LF-SGG can approach, and in some domains surpass, location-based methods while drastically reducing annotation overhead.
Abstract
Scene Graph Generation (SGG) is a visual understanding task, aiming to describe a scene as a graph of entities and their relationships with each other. Existing works rely on location labels in form of bounding boxes or segmentation masks, increasing annotation costs and limiting dataset expansion. Recognizing that many applications do not require location data, we break this dependency and introduce location-free scene graph generation (LF-SGG). This new task aims at predicting instances of entities, as well as their relationships, without the explicit calculation of their spatial localization. To objectively evaluate the task, the predicted and ground truth scene graphs need to be compared. We solve this NP-hard problem through an efficient branching algorithm. Additionally, we design the first LF-SGG method, Pix2SG, using autoregressive sequence modeling. We demonstrate the effectiveness of our method on three scene graph generation datasets as well as two downstream tasks, image retrieval and visual question answering, and show that our approach is competitive to existing methods while not relying on location cues.
