Table of Contents
Fetching ...

TopoOR: A Unified Topological Scene Representation for the Operating Room

Tony Danjun Wang, Ka Young Kim, Tolga Birdal, Nassir Navab, Lennart Bastian

TL;DR

TopoOR is introduced, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships and proposing a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention.

Abstract

Surgical Scene Graphs abstract the complexity of surgical operating rooms (OR) into a structure of entities and their relations, but existing paradigms suffer from strictly dyadic structural limitations. Frameworks that predominantly rely on pairwise message passing or tokenized sequences flatten the manifold geometry inherent to relational structures and lose structure in the process. We introduce TopoOR, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships. By lifting interactions between entities into higher-order topological cells, TopoOR natively models complex dynamics and multimodality present in the OR. This topological representation subsumes traditional scene graphs, thereby offering strictly greater expressivity. We also propose a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention. In this way, we circumvent combining 3D geometry, audio, and robot kinematics into a single joint latent representation, preserving the precise multimodal structure required for safety-critical reasoning, unlike existing methods. Extensive experiments demonstrate that our approach outperforms traditional graph and LLM-based baselines across sterility breach detection, robot phase prediction, and next-action anticipation

TopoOR: A Unified Topological Scene Representation for the Operating Room

TL;DR

TopoOR is introduced, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships and proposing a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention.

Abstract

Surgical Scene Graphs abstract the complexity of surgical operating rooms (OR) into a structure of entities and their relations, but existing paradigms suffer from strictly dyadic structural limitations. Frameworks that predominantly rely on pairwise message passing or tokenized sequences flatten the manifold geometry inherent to relational structures and lose structure in the process. We introduce TopoOR, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships. By lifting interactions between entities into higher-order topological cells, TopoOR natively models complex dynamics and multimodality present in the OR. This topological representation subsumes traditional scene graphs, thereby offering strictly greater expressivity. We also propose a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention. In this way, we circumvent combining 3D geometry, audio, and robot kinematics into a single joint latent representation, preserving the precise multimodal structure required for safety-critical reasoning, unlike existing methods. Extensive experiments demonstrate that our approach outperforms traditional graph and LLM-based baselines across sterility breach detection, robot phase prediction, and next-action anticipation
Paper Structure (8 sections, 3 equations, 3 figures, 3 tables)

This paper contains 8 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: We model surgical ORs as higher-order structures. Our framework explicitly instantiates cells anchored in 3D for physical entities—such as the nurse and surgical robot—as well as data sources from diverse modalities, including audio. Combining these elements into a higher-order structure enables modeling of the unified interaction among the head surgeon, robot, saw, and patient (bottom left), while preserving the complex, multi-actor dynamics of surgical procedures.
  • Figure 2: Overview of TopoOR. Given multi-modal sensory inputs over a temporal window (I), 3D entities and evidence features are initialized (II) and abstracted into a CC $\mathcal{X}$ (III). Higher-order attention is computed across incidence neighborhoods $x \in \mathcal{N}(y)$, using a learnable rank-bias to preserve structural heterogeneity (IV). Pooled representations are routed to downstream tasks enabling, e.g., simultaneous next-action anticipation and robot-phase prediction (V).
  • Figure 3: Qualitative results and scene abstraction. We demonstrate improved performance in (I) robot phase prediction over baseline models. While all methods operate on the same (II) explicit 3D input entities, (III) illustrates the distinct structural formulation of each approach. Unlike the flattened relations of standard networks, our higher-order representation explicitly models the hierarchical incidence of the surgical environment through rank-0, rank-1, and rank-2 cells.

Theorems & Definitions (3)

  • definition 1: Combinatorial Complex hajij2023combinatorial
  • definition 2: Incidence Neighborhoods
  • definition 3: Higher-Order Attention Layer