From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

Zuyao Chen; Jinlin Wu; Zhen Lei; Chang Wen Chen

From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen

TL;DR

This work introduces OvSGTR, a DETR-like transformer framework that enables fully open-vocabulary scene graph generation by predicting unseen objects and relationships. It jointly leverages frozen image and text encoders, a lightweight relation head, and a visual-concept alignment mechanism to replace fixed vocabulary classifiers. The authors explore three relation-aware pre-training pipelines and implement a visual-concept retention with knowledge distillation to mitigate forgetting during fine-tuning. Extensive experiments on VG150 and GQA demonstrate state-of-the-art performance across closed-set and fully open-vocabulary settings, highlighting the value of large-scale relation-aware pre-training for generalized visual reasoning and graph-like scene understanding.

Abstract

We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.

From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

TL;DR

Abstract

From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)