Table of Contents
Fetching ...

Syn-Mediverse: A Multimodal Synthetic Dataset for Intelligent Scene Understanding of Healthcare Facilities

Rohit Mohan, José Arce, Sassan Mokhtar, Daniele Cattaneo, Abhinav Valada

TL;DR

Syn-Mediverse addresses the lack of public datasets for healthcare facility scene understanding by providing a hyper-realistic multimodal synthetic dataset generated in NVIDIA Isaac Sim, featuring over 48,000 RGB-D images and 1.5 million annotations across five perception tasks. The authors establish a six-task benchmarking protocol and evaluate a spectrum of baselines from classic to state-of-the-art models, revealing dataset difficulty and generalization dynamics across healthcare scenarios. They demonstrate qualitative transfer potential to real-world data via cross-domain experiments and provide an online benchmark to accelerate research in medical robotics and facility management. Overall, Syn-Mediverse offers a scalable, richly annotated resource that enables rigorous evaluation while highlighting open challenges in real-world generalization and domain transfer.

Abstract

Safety and efficiency are paramount in healthcare facilities where the lives of patients are at stake. Despite the adoption of robots to assist medical staff in challenging tasks such as complex surgeries, human expertise is still indispensable. The next generation of autonomous healthcare robots hinges on their capacity to perceive and understand their complex and frenetic environments. While deep learning models are increasingly used for this purpose, they require extensive annotated training data which is impractical to obtain in real-world healthcare settings. To bridge this gap, we present Syn-Mediverse, the first hyper-realistic multimodal synthetic dataset of diverse healthcare facilities. Syn-Mediverse contains over \num{48000} images from a simulated industry-standard optical tracking camera and provides more than 1.5M annotations spanning five different scene understanding tasks including depth estimation, object detection, semantic segmentation, instance segmentation, and panoptic segmentation. We demonstrate the complexity of our dataset by evaluating the performance on a broad range of state-of-the-art baselines for each task. To further advance research on scene understanding of healthcare facilities, along with the public dataset we provide an online evaluation benchmark available at \url{http://syn-mediverse.cs.uni-freiburg.de}

Syn-Mediverse: A Multimodal Synthetic Dataset for Intelligent Scene Understanding of Healthcare Facilities

TL;DR

Syn-Mediverse addresses the lack of public datasets for healthcare facility scene understanding by providing a hyper-realistic multimodal synthetic dataset generated in NVIDIA Isaac Sim, featuring over 48,000 RGB-D images and 1.5 million annotations across five perception tasks. The authors establish a six-task benchmarking protocol and evaluate a spectrum of baselines from classic to state-of-the-art models, revealing dataset difficulty and generalization dynamics across healthcare scenarios. They demonstrate qualitative transfer potential to real-world data via cross-domain experiments and provide an online benchmark to accelerate research in medical robotics and facility management. Overall, Syn-Mediverse offers a scalable, richly annotated resource that enables rigorous evaluation while highlighting open challenges in real-world generalization and domain transfer.

Abstract

Safety and efficiency are paramount in healthcare facilities where the lives of patients are at stake. Despite the adoption of robots to assist medical staff in challenging tasks such as complex surgeries, human expertise is still indispensable. The next generation of autonomous healthcare robots hinges on their capacity to perceive and understand their complex and frenetic environments. While deep learning models are increasingly used for this purpose, they require extensive annotated training data which is impractical to obtain in real-world healthcare settings. To bridge this gap, we present Syn-Mediverse, the first hyper-realistic multimodal synthetic dataset of diverse healthcare facilities. Syn-Mediverse contains over \num{48000} images from a simulated industry-standard optical tracking camera and provides more than 1.5M annotations spanning five different scene understanding tasks including depth estimation, object detection, semantic segmentation, instance segmentation, and panoptic segmentation. We demonstrate the complexity of our dataset by evaluating the performance on a broad range of state-of-the-art baselines for each task. To further advance research on scene understanding of healthcare facilities, along with the public dataset we provide an online evaluation benchmark available at \url{http://syn-mediverse.cs.uni-freiburg.de}
Paper Structure (21 sections, 5 figures, 7 tables)

This paper contains 21 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of the Syn-Mediverse dataset consisting of multi-view images captured from a simulated industry-standard optical tracking camera. We provide pixel-level annotations for depth estimation, object detection, semantic segmentation, instance segmentation, and panoptic segmentation.
  • Figure 2: Depiction of the scene captured with the multi-camera setup in (a), and illustration of a synthetic image pre- and post-noise addition via histogram matching in (b) and (c).
  • Figure 3: Depiction of the diverse and complex environments in the Syn-Mediverse dataset. The images also show the ground truth labels for various tasks overlaid on the image. These images underline the comprehensive range of object types such as medical staff and equipment present in Syn-Mediverse, in addition to illustrating a variety of realistic scenarios through the interplay of lighting conditions and the interaction of elements within each scene. (8$\times$ zoom recommended.)
  • Figure 4: Statistical overview of the Syn-Mediverse dataset. Here, in (a) the distribution of object segments across semantic classes is shown, with blue bars for stuff classes and red bars for thing classes with instances. In (b) the distribution of images according to the instance count is depicted, reflecting the complexity of the scenes in the dataset. Following, (c) highlights the number of semantic classes present in each image, indicating the richness of class diversity. Lastly, (d) displays the depth-wise pixel distribution, underlining the depth variability in our dataset.
  • Figure 5: Qualitative evaluation of leveraging synthetic knowledge on real-world semantic segmentation within medical environments, featuring images from the MVOR and 4D-OR datasets.