Table of Contents
Fetching ...

REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

Maëlic Neau, Zoe Falomir

TL;DR

REACT++ is proposed, a new state-of-the-art model for real-time SGG that combines efficient feature extraction and subject-to-object cross-attention within the prototype space, and balances latency and representational power.

Abstract

Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we build on the powerful Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture and propose REACT++, a new state-of-the-art model for real-time SGG. By leveraging efficient feature extraction and subject-to-object cross-attention within the prototype space, REACT++ balances latency and representational power. REACT++ achieves the highest inference speed among existing SGG models, improving relation prediction accuracy without sacrificing object detection performance. Compared to the previous REACT version, REACT++ is 20% faster with a gain of 10% in relation prediction accuracy on average. The code is available at https://github.com/Maelic/SGG-Benchmark.

REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

TL;DR

REACT++ is proposed, a new state-of-the-art model for real-time SGG that combines efficient feature extraction and subject-to-object cross-attention within the prototype space, and balances latency and representational power.

Abstract

Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we build on the powerful Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture and propose REACT++, a new state-of-the-art model for real-time SGG. By leveraging efficient feature extraction and subject-to-object cross-attention within the prototype space, REACT++ balances latency and representational power. REACT++ achieves the highest inference speed among existing SGG models, improving relation prediction accuracy without sacrificing object detection performance. Compared to the previous REACT version, REACT++ is 20% faster with a gain of 10% in relation prediction accuracy on average. The code is available at https://github.com/Maelic/SGG-Benchmark.
Paper Structure (30 sections, 8 equations, 5 figures, 7 tables)

This paper contains 30 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparing PE-NET architecture zheng2023prototype (top) with our REACT++ architeture (botton) in Stage 1. The dashed boxes in orange [- - -] represent modified or added components, $\bigotimes$ denotes element-wise concatenation.
  • Figure 2: Our REACT++ architecture is based on REACT Neau_2025_BMVC. The dashed boxes in purple [- - -] represent modified or added components to REACT. The dashed boxes in orange [- - -] represent modified or added components to the original PE-NET (see \ref{['fig:sgg_pipeline:penet']}). $\bigotimes$ denotes element-wise concatenation.
  • Figure 3: Left - latency comparison of the REACT++ model with the different Feature Extraction components. Right - evolution of the F1@K metric across different stages of training, for the same components.
  • Figure 4: Left: Latency for the REACT++ model using a different number of proposals per image, with batch size 1. Right: Average F1@k for the REACT++ model with different number fo proposals.
  • Figure 5: Average Recall@K, meanRecall@K, and mAP@50 performance for REACT++ against the corresponding latency using different variants of the YOLO12 model tian2025yolov12.