Table of Contents
Fetching ...

S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR

Jialun Pei, Diandian Guo, Jingyang Zhang, Manxi Lin, Yueming Jin, Pheng-Ann Heng

TL;DR

This work tackles scene graph generation in operating rooms by eliminating multi-stage pipelines in favor of a single-stage, end-to-end, bi-modal transformer that fuses 2D multi-view imagery and 3D point clouds. The approach introduces a View-Sync Transfusion module for cross-view interaction, a Geometry-Visual Cohesion mechanism to integrate appearance and geometry, and a relation-sensitive transformer with dynamic relation queries to predict subject–object relations directly. Empirical results on the 4D-OR dataset show superior precision, recall, and macro F1 for key surgical relations, with substantially fewer parameters and faster inference than prior OR-SGG methods, and strong generalization to the 3DSSG benchmark. Clinically, the method improves downstream tasks such as clinical role prediction, demonstrating practical potential for real-time surgical intelligence and workflow optimization.

Abstract

Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S^2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S^2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved.

S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR

TL;DR

This work tackles scene graph generation in operating rooms by eliminating multi-stage pipelines in favor of a single-stage, end-to-end, bi-modal transformer that fuses 2D multi-view imagery and 3D point clouds. The approach introduces a View-Sync Transfusion module for cross-view interaction, a Geometry-Visual Cohesion mechanism to integrate appearance and geometry, and a relation-sensitive transformer with dynamic relation queries to predict subject–object relations directly. Empirical results on the 4D-OR dataset show superior precision, recall, and macro F1 for key surgical relations, with substantially fewer parameters and faster inference than prior OR-SGG methods, and strong generalization to the 3DSSG benchmark. Clinically, the method improves downstream tasks such as clinical role prediction, demonstrating practical potential for real-time surgical intelligence and workflow optimization.

Abstract

Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S^2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S^2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved.
Paper Structure (33 sections, 10 equations, 6 figures, 9 tables)

This paper contains 33 sections, 10 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Diagrams of multimodal architectures for scene graph generation in operating rooms. (a) Existing multi-stage model ozsoy20224d; (b) Our proposed single-stage model in an end-to-end manner. Black arrows represent forward inference while grey arrows represent backpropagation during the training phase.
  • Figure 2: Overview of the proposed single-stage multi-view bi-modal S$^2$Former-OR for scene graph generation from operating rooms. We first extract appearance and geometric features separately based on 2D multi-view images and 3D point cloud inputs. In the multi-view unit, a View-Sync Transfusion (VST) is suggested for synthesizing multi-view semantic features. Then, we introduce Geometry-Visual Cohesion (GVC) to fuse 2D synergic features and 3D point cloud features to obtain unified features. In the entity unit, we utilize the unified features and entity queries to predict entity proposals; in the relation unit, we generate dynamic relation queries by assembling latent entity pairs with relational trait priors, which are fed into our relation-sensitive transformer to generate scene graphs in the operating theatre.
  • Figure 3: Illustration of our relation-sensitive transformer decoder.
  • Figure 4: Qualitative results of the 4D-OR model ozsoy20224d, LABRAD-OR ozsoy2023labrad, and our S$^2$Former-OR on the 4D-OR validation set. The blue rectangles represent the human attribute and the red ellipses denote the object attribute. Correct/Erroneous predicted relationships are shown in green/red.
  • Figure 5: Visualizations of attention maps in our relation-sensitive transformer on the 4D-OR validation set. The top row displays the main view weight map with the unified feature $F_{u}$; the bottom row displays the View#4 weight map with the ResNet feature $R_{4}$.
  • ...and 1 more figures