Table of Contents
Fetching ...

Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation

Lin Li, Chuhan Zhang, Dong Zhang, Chong Sun, Chen Li, Long Chen

TL;DR

The paper addresses open-vocabulary scene graph generation by introducing an interaction-centric paradigm that models relational dynamics during both knowledge infusion and transfer. It proposes bidirectional interaction prompts to generate robust pseudo-supervision and a two-part transfer scheme—interaction-guided query selection and interaction-consistent knowledge distillation—to reduce mismatches and preserve relational semantics from a pre-trained teacher. Empirical results on VG, GQA, and PSG demonstrate state-of-the-art performance and solid ablations validate each component's contribution. Overall, the work highlights the importance of explicitly modeling interactions to achieve robust open-world scene understanding and relational reasoning.

Abstract

Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) \textit{Infusing knowledge} into large-scale models via pre-training on large datasets; 2) \textit{Transferring knowledge} from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an inter\textbf{AC}tion-\textbf{C}entric end-to-end OVSGG framework (\textbf{ACC}) in an interaction-driven paradigm to minimize these mismatches. For \textit{interaction-centric knowledge infusion}, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model's interaction knowledge. For \textit{interaction-centric knowledge transfer}, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.

Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation

TL;DR

The paper addresses open-vocabulary scene graph generation by introducing an interaction-centric paradigm that models relational dynamics during both knowledge infusion and transfer. It proposes bidirectional interaction prompts to generate robust pseudo-supervision and a two-part transfer scheme—interaction-guided query selection and interaction-consistent knowledge distillation—to reduce mismatches and preserve relational semantics from a pre-trained teacher. Empirical results on VG, GQA, and PSG demonstrate state-of-the-art performance and solid ablations validate each component's contribution. Overall, the work highlights the importance of explicitly modeling interactions to achieve robust open-world scene understanding and relational reasoning.

Abstract

Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) \textit{Infusing knowledge} into large-scale models via pre-training on large datasets; 2) \textit{Transferring knowledge} from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an inter\textbf{AC}tion-\textbf{C}entric end-to-end OVSGG framework (\textbf{ACC}) in an interaction-driven paradigm to minimize these mismatches. For \textit{interaction-centric knowledge infusion}, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model's interaction knowledge. For \textit{interaction-centric knowledge transfer}, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.

Paper Structure

This paper contains 30 sections, 12 equations, 11 figures, 13 tables, 2 algorithms.

Figures (11)

  • Figure 1: Overview of the end-to-end OVSGG framework challenges. a) Knowledge Infusion, using solely object categories for detection causes ambiguity in associating object pairs (e.g., identifying the correct "man-surfboard" for the "hold"). b) Knowledge Transfer, vast object query$^{\ref{['footnote:query_explain']}}$ candidates make misaligned non-interacting objects (e.g., "man") with interacting training target "man" in $\langle$man, riding, horse$\rangle$.
  • Figure 2: Overview of ACC for OVSGG. (a) Interaction-Centric Knowledge Infusion: Employs bidirectional interaction prompts and rule-based bounding box combinations for robust pseudo-supervision, empowering the model's grasp of relational knowledge. (b) Interaction-Centric Knowledge Transfer: Uses interaction-guided query selection to prioritize learning on interacting objects, and interaction-consistent KD transfers comprehensive relational insights from the pre-trained VLM to ensure robust generalization to novel categories.
  • Figure 3: Illustration of interaction-consistent KD.
  • Figure 4: Pseudo supervision generation in ACC.
  • Figure 5: Interaction-guided query selection.
  • ...and 6 more figures