Table of Contents
Fetching ...

Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation

Lin Li, Chuhan Zhang, Dong Zhang, Chong Sun, Chen Li, Long Chen

TL;DR

This work tackles open vocabulary scene graph generation by addressing the key limitation that existing pipelines treat all objects equally, leading to mismatches in interacting-object relations. It introduces INOVA, a interaction-aware OVSGG framework that combines three components: bidirectional interaction prompts for pre-training target grounding, a two-step interaction-guided query selection for supervised fine-tuning, and interaction-consistent knowledge distillation to preserve both semantic and relational structure. Empirical results on Visual Genome and GQA show state-of-the-art performance in both OvR-SGG and OvD+R-SGG settings, with substantial gains in $R@100$ for novel relations and improved robustness across base and novel categories. The approach demonstrates that explicitly modeling object interactions yields more accurate and generalizable scene graphs, enabling more reliable real-world scene understanding and downstream reasoning tasks.

Abstract

Today's open vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Most existing methods adopt a two-stage pipeline: weakly supervised pre-training with image captions and supervised fine-tuning (SFT) on fully annotated scene graphs. Nonetheless, they omit explicit modeling of interacting objects and treat all objects equally, resulting in mismatched relation pairs. To this end, we propose an interaction-aware OVSGG framework INOVA. During pre-training, INOVA employs an interaction-aware target generation strategy to distinguish interacting objects from non-interacting ones. In SFT, INOVA devises an interaction-guided query selection tactic to prioritize interacting objects during bipartite graph matching. Besides, INOVA is equipped with an interaction-consistent knowledge distillation to enhance the robustness by pushing interacting object pairs away from the background. Extensive experiments on two benchmarks (VG and GQA) show that INOVA achieves state-of-the-art performance, demonstrating the potential of interaction-aware mechanisms for real-world applications.

Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation

TL;DR

This work tackles open vocabulary scene graph generation by addressing the key limitation that existing pipelines treat all objects equally, leading to mismatches in interacting-object relations. It introduces INOVA, a interaction-aware OVSGG framework that combines three components: bidirectional interaction prompts for pre-training target grounding, a two-step interaction-guided query selection for supervised fine-tuning, and interaction-consistent knowledge distillation to preserve both semantic and relational structure. Empirical results on Visual Genome and GQA show state-of-the-art performance in both OvR-SGG and OvD+R-SGG settings, with substantial gains in for novel relations and improved robustness across base and novel categories. The approach demonstrates that explicitly modeling object interactions yields more accurate and generalizable scene graphs, enabling more reliable real-world scene understanding and downstream reasoning tasks.

Abstract

Today's open vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Most existing methods adopt a two-stage pipeline: weakly supervised pre-training with image captions and supervised fine-tuning (SFT) on fully annotated scene graphs. Nonetheless, they omit explicit modeling of interacting objects and treat all objects equally, resulting in mismatched relation pairs. To this end, we propose an interaction-aware OVSGG framework INOVA. During pre-training, INOVA employs an interaction-aware target generation strategy to distinguish interacting objects from non-interacting ones. In SFT, INOVA devises an interaction-guided query selection tactic to prioritize interacting objects during bipartite graph matching. Besides, INOVA is equipped with an interaction-consistent knowledge distillation to enhance the robustness by pushing interacting object pairs away from the background. Extensive experiments on two benchmarks (VG and GQA) show that INOVA achieves state-of-the-art performance, demonstrating the potential of interaction-aware mechanisms for real-world applications.

Paper Structure

This paper contains 15 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the OVSGG framework challenges. 1) VLM Pre-training, using solely entity categories for object detection causes ambiguity in associating object pairs (e.g., identifying the correct "man-surfboard" for the "hold"). 2) SFT, bipartite graph matching misaligns non-interacting objects (e.g., "man") with interacting target "man" in $\langle$man, riding, horse$\rangle$.
  • Figure 2: Overview of INOVA for OVSGG. (a) VLM Pre-training: Interaction-aware target generation uses bidirectional interaction prompts and rule-based bounding box combinations to generate supervision, enriching object tokens with contextual interaction semantics. (b) SFT: A two-step interaction-guided query selection (IQS) prioritizes interacting objects and integrates relational context into object tokens, refining queries for the decoder. Bipartite graph matching aligns predictions with ground-truth for entity and relation classification.
  • Figure 3: Illustration of interaction-consistent KD.
  • Figure 4: Interaction-aware target generation.
  • Figure 5: Interaction-guided query selection.