Table of Contents
Fetching ...

ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling

Ege Özsoy, Chantal Pellegrini, Matthias Keicher, Nassir Navab

TL;DR

ORacle addresses the generalization gap in operating-room domain modeling by leveraging Large Vision-Language Models to generate semantic scene graphs from multiview RGB data. It introduces a multiview image pooler and knowledge integration that incorporates temporal context, textual and visual descriptors, and a symbolic representation to support open-vocabulary adaptation. An automatic data augmentation pipeline provides substantial variability, and a digitally altered adaptability benchmark tests robustness. On the 4D-OR dataset, ORacle achieves state-of-the-art scene graph generation with less data than prior methods and demonstrates robust adaptation to unseen views, actions, and tool appearances. This work significantly lowers hardware and data requirements while enabling scalable surgical data science.

Abstract

Every day, countless surgeries are performed worldwide, each within the distinct settings of operating rooms (ORs) that vary not only in their setups but also in the personnel, tools, and equipment used. This inherent diversity poses a substantial challenge for achieving a holistic understanding of the OR, as it requires models to generalize beyond their initial training datasets. To reduce this gap, we introduce ORacle, an advanced vision-language model designed for holistic OR domain modeling, which incorporates multi-view and temporal capabilities and can leverage external knowledge during inference, enabling it to adapt to previously unseen surgical scenarios. This capability is further enhanced by our novel data augmentation framework, which significantly diversifies the training dataset, ensuring ORacle's proficiency in applying the provided knowledge effectively. In rigorous testing, in scene graph generation, and downstream tasks on the 4D-OR dataset, ORacle not only demonstrates state-of-the-art performance but does so requiring less data than existing models. Furthermore, its adaptability is displayed through its ability to interpret unseen views, actions, and appearances of tools and equipment. This demonstrates ORacle's potential to significantly enhance the scalability and affordability of OR domain modeling and opens a pathway for future advancements in surgical data science. We will release our code and data upon acceptance.

ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling

TL;DR

ORacle addresses the generalization gap in operating-room domain modeling by leveraging Large Vision-Language Models to generate semantic scene graphs from multiview RGB data. It introduces a multiview image pooler and knowledge integration that incorporates temporal context, textual and visual descriptors, and a symbolic representation to support open-vocabulary adaptation. An automatic data augmentation pipeline provides substantial variability, and a digitally altered adaptability benchmark tests robustness. On the 4D-OR dataset, ORacle achieves state-of-the-art scene graph generation with less data than prior methods and demonstrates robust adaptation to unseen views, actions, and tool appearances. This work significantly lowers hardware and data requirements while enabling scalable surgical data science.

Abstract

Every day, countless surgeries are performed worldwide, each within the distinct settings of operating rooms (ORs) that vary not only in their setups but also in the personnel, tools, and equipment used. This inherent diversity poses a substantial challenge for achieving a holistic understanding of the OR, as it requires models to generalize beyond their initial training datasets. To reduce this gap, we introduce ORacle, an advanced vision-language model designed for holistic OR domain modeling, which incorporates multi-view and temporal capabilities and can leverage external knowledge during inference, enabling it to adapt to previously unseen surgical scenarios. This capability is further enhanced by our novel data augmentation framework, which significantly diversifies the training dataset, ensuring ORacle's proficiency in applying the provided knowledge effectively. In rigorous testing, in scene graph generation, and downstream tasks on the 4D-OR dataset, ORacle not only demonstrates state-of-the-art performance but does so requiring less data than existing models. Furthermore, its adaptability is displayed through its ability to interpret unseen views, actions, and appearances of tools and equipment. This demonstrates ORacle's potential to significantly enhance the scalability and affordability of OR domain modeling and opens a pathway for future advancements in surgical data science. We will release our code and data upon acceptance.
Paper Structure (9 sections, 3 figures, 7 tables)

This paper contains 9 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: An overview of our end-to-end scene graph generation architecture. ORacle takes as input multiview images and optionally additional knowledge and directly generates a scene graph token by token, considering all information at once.
  • Figure 2: An overview of our automatic variability enhancement pipeline, used during training. It first samples a set of attributes, then generates a matching object. Thereafter, it samples a suiting scene from 4D-OR and correctly places the sampled objects into it. For examples of more realistic scenes used during evaluation, see \ref{['fig:qual_results']}.
  • Figure 3: Results on our adaptability benchmark of the non-adaptable ORacle-MV model (nA) and our adaptable models (A). Left: ORacle-adapt-text; Right: ORacle-adapt-vis.