Table of Contents
Fetching ...

Towards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion

Nikolo Rohrmoser, Ghazal Ghazaei, Michael Sommersperger, Nassir Navab

Abstract

Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-time processing rates of 22.5 ms per frame. Especially for close distances to the retina (below 1 mm), the distance estimation accuracy improved from 284 $μm$ (OPMI only) to 33 $μm$ (multimodal). Conclusion: Feature fusion of multimodal imaging can enhance multi-task prediction accuracy compared to single-modality processing and real-time processing performance can be achieved through tailored network design. While our results demonstrate the potential of multi-modal processing for image-guided vitreoretinal surgery, they also underline key challenges that motivate future research toward more reliable, consistent, and comprehensive surgical scene understanding.

Towards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion

Abstract

Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-time processing rates of 22.5 ms per frame. Especially for close distances to the retina (below 1 mm), the distance estimation accuracy improved from 284 (OPMI only) to 33 (multimodal). Conclusion: Feature fusion of multimodal imaging can enhance multi-task prediction accuracy compared to single-modality processing and real-time processing performance can be achieved through tailored network design. While our results demonstrate the potential of multi-modal processing for image-guided vitreoretinal surgery, they also underline key challenges that motivate future research toward more reliable, consistent, and comprehensive surgical scene understanding.

Paper Structure

This paper contains 8 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Multimodal data in ophthalmic surgery consists of OPMI (left) and two perpendicularly aligned iOCT B-scan images (middle, right). The blue and magenta lines in the OPMI image indicate the locations of the B-scans.
  • Figure 2: The overall architecture. A Yolo-NAS backbone processes the OPMI stream, while a modified ResNet-18 extracts features from iOCT scans. An attention-based fusion module integrates these modalities before they are further refined by a recurrent module for temporal awareness. Finally, task-specific heads generate outputs including detection, keypoint estimation, and distance prediction.
  • Figure 3: Cross-attention fusion. OPMI pixel features (queries) attend to iOCT column descriptors (keys/values) enriched with positional encodings to inject spatially aware depth cues into the fused representation. The output retains the spatial resolution and channel dimensionality required by the downstream prediction heads.
  • Figure 4: Distance error distributions for a single peeling sequence in the test set stratified by certainty for the SM model utilizing only OPMI. Each boxplot summarizes the error distribution for one certainty category of certain (>90%), moderate (50-90%), and uncertain (<50%).
  • Figure 5: Qualitative results for distance estimation on a peeling sequence using SM (a) and MM (b) models. The prediction's (points) certainty is encoded by color and error bar, where smaller and darker indicate higher certainty. A dark dot represents certainty >90%.