Table of Contents
Fetching ...

From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen

TL;DR

This work introduces OvSGTR, a DETR-like transformer framework that enables fully open-vocabulary scene graph generation by predicting unseen objects and relationships. It jointly leverages frozen image and text encoders, a lightweight relation head, and a visual-concept alignment mechanism to replace fixed vocabulary classifiers. The authors explore three relation-aware pre-training pipelines and implement a visual-concept retention with knowledge distillation to mitigate forgetting during fine-tuning. Extensive experiments on VG150 and GQA demonstrate state-of-the-art performance across closed-set and fully open-vocabulary settings, highlighting the value of large-scale relation-aware pre-training for generalized visual reasoning and graph-like scene understanding.

Abstract

We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.

From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

TL;DR

This work introduces OvSGTR, a DETR-like transformer framework that enables fully open-vocabulary scene graph generation by predicting unseen objects and relationships. It jointly leverages frozen image and text encoders, a lightweight relation head, and a visual-concept alignment mechanism to replace fixed vocabulary classifiers. The authors explore three relation-aware pre-training pipelines and implement a visual-concept retention with knowledge distillation to mitigate forgetting during fine-tuning. Extensive experiments on VG150 and GQA demonstrate state-of-the-art performance across closed-set and fully open-vocabulary settings, highlighting the value of large-scale relation-aware pre-training for generalized visual reasoning and graph-like scene understanding.

Abstract

We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.

Paper Structure

This paper contains 22 sections, 9 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Closed-set SGG.
  • Figure 2: OvD-SGG.
  • Figure 3: OvR-SGG.
  • Figure 4: OvD+R-SGG.
  • Figure 6: Comparison of different pipelines for relation-aware pre-training. (a) Early weakly-supervised methods zhong2021learningli2022integrating utilize a language scene parser mao2018parser to extract relationship triplets; (b) LLM-based methods replace the scene parser with a more powerful LLM to synthesize a more dense scene graph, e.g., GPT4SGGchen2023gpt4sgg; (c) Multimodal LLM (MLLM)-based pipeline directly digests an input image and outputs a dense scene graph, e.g., MegaSG chen2024makes.
  • ...and 4 more figures