Table of Contents
Fetching ...

Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation

Minghan Chen, Guikun Chen, Wenguan Wang, Yi Yang

TL;DR

Hydra-SGG tackles the slow convergence of DETR-based one-stage scene graph generation caused by sparse supervision and false negatives. It introduces Hybrid Relation Assignment, combining One-to-One matching with a One-to-Many mechanism, and a training-time Hydra Branch—an auxiliary decoder without self-attention—to promote duplicate relation predictions and enrich supervision signals. The approach yields substantial gains in mean recall on VG150, Open Images V6, and GQA, achieving state-of-the-art results with dramatically faster convergence (e.g., 12 epochs on VG150) compared to prior DETR-based methods. This work advances end-to-end SGG by balancing supervision and architectural design to realize practical, efficient scene understanding with potential for open vocabulary extensions in future work.

Abstract

DETR introduces a simplified one-stage framework for scene graph generation (SGG) but faces challenges of sparse supervision and false negative samples. The former occurs because each image typically contains fewer than 10 relation annotations, while DETR-based SGG models employ over 100 relation queries. Each ground truth relation is assigned to only one query during training. The latter arises when one ground truth relation may have multiple queries with similar matching scores, leading to suboptimally matched queries being treated as negative samples. To address these, we propose Hydra-SGG, a one-stage SGG method featuring a Hybrid Relation Assignment. This approach combines a One-to-One Relation Assignment with an IoU-based One-to-Many Relation Assignment, increasing positive training samples and mitigating sparse supervision. In addition, we empirically demonstrate that removing self-attention between relation queries leads to duplicate predictions, which actually benefits the proposed One-to-Many Relation Assignment. With this insight, we introduce Hydra Branch, an auxiliary decoder without self-attention layers, to further enhance One-to-Many Relation Assignment by promoting different queries to make the same relation prediction. Hydra-SGG achieves state-of-the-art performance on multiple datasets, including VG150 (16.0 mR@50), Open Images V6 (50.1 weighted score), and GQA (12.7 mR@50).

Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation

TL;DR

Hydra-SGG tackles the slow convergence of DETR-based one-stage scene graph generation caused by sparse supervision and false negatives. It introduces Hybrid Relation Assignment, combining One-to-One matching with a One-to-Many mechanism, and a training-time Hydra Branch—an auxiliary decoder without self-attention—to promote duplicate relation predictions and enrich supervision signals. The approach yields substantial gains in mean recall on VG150, Open Images V6, and GQA, achieving state-of-the-art results with dramatically faster convergence (e.g., 12 epochs on VG150) compared to prior DETR-based methods. This work advances end-to-end SGG by balancing supervision and architectural design to realize practical, efficient scene understanding with potential for open vocabulary extensions in future work.

Abstract

DETR introduces a simplified one-stage framework for scene graph generation (SGG) but faces challenges of sparse supervision and false negative samples. The former occurs because each image typically contains fewer than 10 relation annotations, while DETR-based SGG models employ over 100 relation queries. Each ground truth relation is assigned to only one query during training. The latter arises when one ground truth relation may have multiple queries with similar matching scores, leading to suboptimally matched queries being treated as negative samples. To address these, we propose Hydra-SGG, a one-stage SGG method featuring a Hybrid Relation Assignment. This approach combines a One-to-One Relation Assignment with an IoU-based One-to-Many Relation Assignment, increasing positive training samples and mitigating sparse supervision. In addition, we empirically demonstrate that removing self-attention between relation queries leads to duplicate predictions, which actually benefits the proposed One-to-Many Relation Assignment. With this insight, we introduce Hydra Branch, an auxiliary decoder without self-attention layers, to further enhance One-to-Many Relation Assignment by promoting different queries to make the same relation prediction. Hydra-SGG achieves state-of-the-art performance on multiple datasets, including VG150 (16.0 mR@50), Open Images V6 (50.1 weighted score), and GQA (12.7 mR@50).
Paper Structure (17 sections, 5 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 5 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison with other SGG methods in mR@50 and training epochs on VG150 xu2017scene.
  • Figure 2: (a) Previous DETR-based SGG methods such as RelTR cong2023reltr and SGTR li2022sgtr match each GT relation with only one query. (b) Our Hybrid Relation Assignment utilizes both One-to-One and One-to-Many assignments, generating more positive samples and thus accelerating training.
  • Figure 3: Overall pipeline of Hydra-SGG: For simplicity, FFN inside the Transformer layer are omitted. Hydra-SGG incorporates two Transformer decoders: HydraBranch and RelDecoder. HydraBranch shares its parameters with RelDecoder but removes self-attention layers. Hydra-SGG combines One-to-One and One-to-Many assignments in a synergy, generating more supervision signals.
  • Figure 4: (a) The average number of positive samples of VG150 train and val for One-to-One and Hybrid Relation Assignment. (b) The percentage increase in positive samples achieved by Hybrid Relation Assignment compared to the One-to-One baseline. (c) ADS on VG150 val. (d)-(e) The visualizations show that for the same group of queries that previously predicted different relations, removing the self-attention layers causes them to make identical predictions. The Q ID column represents the ID of each relation query.
  • Figure 5: Qualitative results §\ref{['sect:Qualitative']}. (a)-(d) compare Hydra-SGG and RelTR cong2023reltr on a VG150 krishna2017visualval image. We use the same color for each entity category, and the color of a predicate matches that of its subject. Differences are highlighted with red dashed rectangles . (e)-(g) show scene graphs generated by Hydra-SGG from images sourced from https://unsplash.com, a platform for freely-usable images. These images are real-world, "in the wild" scenarios, demonstrating our model's capability to handle diverse and unseen visual content.
  • ...and 2 more figures