Table of Contents
Fetching ...

A Framework for Multimodal Medical Image Interaction

Laura Schütz, Sasan Matinfar, Gideon Schafroth, Navid Navab, Merle Fairhurst, Arthur Wagner, Benedikt Wiestler, Ulrich Eck, Nassir Navab

TL;DR

The paper addresses the limited expressive capacity of unimodal medical image representations by introducing a physically informed Multimodal Medical Image Interaction (MMII) framework that delivers real-time audiovisual feedback in a VR setting. It combines anatomy-driven geometry and tissue properties with model-based sonification (via Modalys) to produce perceptually meaningful audio alongside dynamic visuals, guided by a causality-informed design. Through two studies, it demonstrates strong learnability of audiovisual mappings ($p<0.001$) and improved brain tumor localization accuracy ($p<0.05$) compared with conventional unimodal interaction, suggesting practical benefits for surgical navigation and diagnosis. The findings indicate MMII can reduce cognitive load and enhance spatial perception during complex medical tasks, with potential to augment intraoperative feedback and multimodal decision support.

Abstract

Medical doctors rely on images of the human anatomy, such as magnetic resonance imaging (MRI), to localize regions of interest in the patient during diagnosis and treatment. Despite advances in medical imaging technology, the information conveyance remains unimodal. This visual representation fails to capture the complexity of the real, multisensory interaction with human tissue. However, perceiving multimodal information about the patient's anatomy and disease in real-time is critical for the success of medical procedures and patient outcome. We introduce a Multimodal Medical Image Interaction (MMII) framework to allow medical experts a dynamic, audiovisual interaction with human tissue in three-dimensional space. In a virtual reality environment, the user receives physically informed audiovisual feedback to improve the spatial perception of anatomical structures. MMII uses a model-based sonification approach to generate sounds derived from the geometry and physical properties of tissue, thereby eliminating the need for hand-crafted sound design. Two user studies involving 34 general and nine clinical experts were conducted to evaluate the proposed interaction framework's learnability, usability, and accuracy. Our results showed excellent learnability of audiovisual correspondence as the rate of correct associations significantly improved (p < 0.001) over the course of the study. MMII resulted in superior brain tumor localization accuracy (p < 0.05) compared to conventional medical image interaction. Our findings substantiate the potential of this novel framework to enhance interaction with medical images, for example, during surgical procedures where immediate and precise feedback is needed.

A Framework for Multimodal Medical Image Interaction

TL;DR

The paper addresses the limited expressive capacity of unimodal medical image representations by introducing a physically informed Multimodal Medical Image Interaction (MMII) framework that delivers real-time audiovisual feedback in a VR setting. It combines anatomy-driven geometry and tissue properties with model-based sonification (via Modalys) to produce perceptually meaningful audio alongside dynamic visuals, guided by a causality-informed design. Through two studies, it demonstrates strong learnability of audiovisual mappings () and improved brain tumor localization accuracy () compared with conventional unimodal interaction, suggesting practical benefits for surgical navigation and diagnosis. The findings indicate MMII can reduce cognitive load and enhance spatial perception during complex medical tasks, with potential to augment intraoperative feedback and multimodal decision support.

Abstract

Medical doctors rely on images of the human anatomy, such as magnetic resonance imaging (MRI), to localize regions of interest in the patient during diagnosis and treatment. Despite advances in medical imaging technology, the information conveyance remains unimodal. This visual representation fails to capture the complexity of the real, multisensory interaction with human tissue. However, perceiving multimodal information about the patient's anatomy and disease in real-time is critical for the success of medical procedures and patient outcome. We introduce a Multimodal Medical Image Interaction (MMII) framework to allow medical experts a dynamic, audiovisual interaction with human tissue in three-dimensional space. In a virtual reality environment, the user receives physically informed audiovisual feedback to improve the spatial perception of anatomical structures. MMII uses a model-based sonification approach to generate sounds derived from the geometry and physical properties of tissue, thereby eliminating the need for hand-crafted sound design. Two user studies involving 34 general and nine clinical experts were conducted to evaluate the proposed interaction framework's learnability, usability, and accuracy. Our results showed excellent learnability of audiovisual correspondence as the rate of correct associations significantly improved (p < 0.001) over the course of the study. MMII resulted in superior brain tumor localization accuracy (p < 0.05) compared to conventional medical image interaction. Our findings substantiate the potential of this novel framework to enhance interaction with medical images, for example, during surgical procedures where immediate and precise feedback is needed.
Paper Structure (32 sections, 1 equation, 6 figures, 4 tables)

This paper contains 32 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Schema of the tissues used in Study 1 along with their physical properties of stiffness, density and poisson's ratio as well as their resulting mel spectograms.
  • Figure 2: Two question types in Study 1: Audio to Visual (A2V) and Visual to Audio (V2A). In A2V a sound was played and its visual correspondence had to be selected; in V2A a visual was shown and its audio correspondence had to be selected from a choice of three audio files. Mel spectograms are used to symbolize the audio files in this figure.
  • Figure 3: NASA-TLX task load ratings for Visual to Audio (V2A) and Audio to Visual (A2V) question types in Study 1.
  • Figure 4: NASA-TLX task load ratings grouped by the single and multiple structures stage in Study 1.
  • Figure 5: Left: Scene view inside the head-mounted display of the three medical slices and the brain during the visual and audiovisual (MMII) condition in Study 2. Right: Illustration of the sound sphere - the audible range around the point of interaction at the end of the controller ray. The distance from the interaction point to the anatomical structures defines the amplitude of the structures' sound.
  • ...and 1 more figures