Table of Contents
Fetching ...

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang, Changwen Chen

TL;DR

This work proposes a unified framework named OvSGTR, an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories in relation-involved open vocabulary SGG.

Abstract

Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relationbased SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pretraining utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework. Our code is available at https://github.com/gpt4vision/OvSGTR/.

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

TL;DR

This work proposes a unified framework named OvSGTR, an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories in relation-involved open vocabulary SGG.

Abstract

Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relationbased SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pretraining utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework. Our code is available at https://github.com/gpt4vision/OvSGTR/.
Paper Structure (12 sections, 3 equations, 4 figures, 6 tables)

This paper contains 12 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of SGG Scenarios (best view in color). Dashed nodes or edges in (a) - (d) refer to unseen category instances, and stars refer to the difficulty of each setting. Previous works xu2017scenezellers2018neuraltang2019learningtang2020unbiasedchiou2021recoveringli2021bipartitechen2019knowledgezhang2019graphical mainly focus on Closed-set SGG and few studies he2022towardszhang2023learning cover OvD-SGG. In this work, we give a more comprehensive study towards fully open vocabulary SGG.
  • Figure 2: Overview of our proposed OvSGTR . The proposed OvSGTR is equipped with a frozen image backbone to extract visual features, a frozen text encoder to extract text features, and a transformer for decoding scene graphs. Visual features for nodes are the output hidden features of the transformer; Visual features for edges are obtained via a light-weight relation head (i.e., with only two-layer MLP). Visual-concept alignment associates visual features of nodes/edges with corresponding text features. Visual-concept retention aims to transfer the teacher's capability of recognizing unseen categories to the student.
  • Figure 3: Ablation study of relation queries on VG150 validation set (Closed-set SGG).
  • Figure 4: Qualitative results of our model on VG150 test set (best view in color). For clarity, we only show triplets with high confidence in top-20 predictions. Dashed nodes or arrows refer to novel object categories or novel relationships.