REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

Maëlic Neau; Zoe Falomir

REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

Maëlic Neau, Zoe Falomir

TL;DR

REACT++ is proposed, a new state-of-the-art model for real-time SGG that combines efficient feature extraction and subject-to-object cross-attention within the prototype space, and balances latency and representational power.

Abstract

Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we build on the powerful Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture and propose REACT++, a new state-of-the-art model for real-time SGG. By leveraging efficient feature extraction and subject-to-object cross-attention within the prototype space, REACT++ balances latency and representational power. REACT++ achieves the highest inference speed among existing SGG models, improving relation prediction accuracy without sacrificing object detection performance. Compared to the previous REACT version, REACT++ is 20% faster with a gain of 10% in relation prediction accuracy on average. The code is available at https://github.com/Maelic/SGG-Benchmark.

REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

TL;DR

Abstract

Paper Structure (30 sections, 8 equations, 5 figures, 7 tables)

This paper contains 30 sections, 8 equations, 5 figures, 7 tables.

Introduction
Related Work
Scene Graph Generation
Real-Time SGG
Investigating Bottlenecks in SGG
Toward Real-Time SGG
Decoupled Two-Stage SGG
DAMP: Detection-Anchored Multi-scale Pooling
(1) Multi-scale gather with Gaussian neighbourhood.
(2) Projection and fusion.
CARPE: Cross-Attention Rotary Prototype Embedding
(1) Semantic lifting.
(2) Visual–semantic fusion.
(3) Geometry encoding.
(4) Prototype bank.
...and 15 more sections

Figures (5)

Figure 1: Comparing PE-NET architecture zheng2023prototype (top) with our REACT++ architeture (botton) in Stage 1. The dashed boxes in orange [- - -] represent modified or added components, $\bigotimes$ denotes element-wise concatenation.
Figure 2: Our REACT++ architecture is based on REACT Neau_2025_BMVC. The dashed boxes in purple [- - -] represent modified or added components to REACT. The dashed boxes in orange [- - -] represent modified or added components to the original PE-NET (see \ref{['fig:sgg_pipeline:penet']}). $\bigotimes$ denotes element-wise concatenation.
Figure 3: Left - latency comparison of the REACT++ model with the different Feature Extraction components. Right - evolution of the F1@K metric across different stages of training, for the same components.
Figure 4: Left: Latency for the REACT++ model using a different number of proposals per image, with batch size 1. Right: Average F1@k for the REACT++ model with different number fo proposals.
Figure 5: Average Recall@K, meanRecall@K, and mAP@50 performance for REACT++ against the corresponding latency using different variants of the YOLO12 model tian2025yolov12.

REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

TL;DR

Abstract

REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)