Table of Contents
Fetching ...

A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning

Liuyi Jin, Pasan Gunawardena, Amran Haroon, Runzhi Wang, Sangwoo Lee, Radu Stoleru, Michael Middleton, Zepeng Huo, Jeeeun Kim, Jason Moats

TL;DR

EMSGlass introduces EMSNet, the first multimodal multitask EMS model, and EMSServe, a low-latency, edge-aware serving framework that handles asynchronous modality arrival. EMSNet fuses text, vitals, and scene images into a unified representation $F_C \\in \\mathbb{R}^{|F_T|+|F_V|+|F_I|}$ to support five EMS tasks, while PMI enables effective learning with highly imbalanced modality data. EMSServe uses a modality-aware splitter, offline inference time profiling, and adaptive offloading with a feature cache to achieve 1.9x–11.7x speedups over direct PyTorch execution. A user study with six EMTs demonstrates improved real-time situational awareness and faster decision-making, advancing practical AI-enabled EMS workflows. The work provides open-source data, code, and models to foster future development of AI-enabled EMS systems that bridge multimodal intelligence with real-world emergency response workflows.

Abstract

Emergency Medical Technicians (EMTs) operate in high-pressure environments, making rapid, life-critical decisions under heavy cognitive and operational loads. We present EMSGlass, a smart-glasses system powered by EMSNet, the first multimodal multitask model for Emergency Medical Services (EMS), and EMSServe, a low-latency multimodal serving framework tailored to EMS scenarios. EMSNet integrates text, vital signs, and scene images to construct a unified real-time understanding of EMS incidents. Trained on real-world multimodal EMS datasets, EMSNet simultaneously supports up to five critical EMS tasks with superior accuracy compared to state-of-the-art unimodal baselines. Built on top of PyTorch, EMSServe introduces a modality-aware model splitter and a feature caching mechanism, achieving adaptive and efficient inference across heterogeneous hardware while addressing the challenge of asynchronous modality arrival in the field. By optimizing multimodal inference execution in EMS scenarios, EMSServe achieves 1.9x -- 11.7x speedup over direct PyTorch multimodal inference. A user study evaluation with six professional EMTs demonstrates that EMSGlass enhances real-time situational awareness, decision-making speed, and operational efficiency through intuitive on-glass interaction. In addition, qualitative insights from the user study provide actionable directions for extending EMSGlass toward next-generation AI-enabled EMS systems, bridging multimodal intelligence with real-world emergency response workflows.

A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning

TL;DR

EMSGlass introduces EMSNet, the first multimodal multitask EMS model, and EMSServe, a low-latency, edge-aware serving framework that handles asynchronous modality arrival. EMSNet fuses text, vitals, and scene images into a unified representation to support five EMS tasks, while PMI enables effective learning with highly imbalanced modality data. EMSServe uses a modality-aware splitter, offline inference time profiling, and adaptive offloading with a feature cache to achieve 1.9x–11.7x speedups over direct PyTorch execution. A user study with six EMTs demonstrates improved real-time situational awareness and faster decision-making, advancing practical AI-enabled EMS workflows. The work provides open-source data, code, and models to foster future development of AI-enabled EMS systems that bridge multimodal intelligence with real-world emergency response workflows.

Abstract

Emergency Medical Technicians (EMTs) operate in high-pressure environments, making rapid, life-critical decisions under heavy cognitive and operational loads. We present EMSGlass, a smart-glasses system powered by EMSNet, the first multimodal multitask model for Emergency Medical Services (EMS), and EMSServe, a low-latency multimodal serving framework tailored to EMS scenarios. EMSNet integrates text, vital signs, and scene images to construct a unified real-time understanding of EMS incidents. Trained on real-world multimodal EMS datasets, EMSNet simultaneously supports up to five critical EMS tasks with superior accuracy compared to state-of-the-art unimodal baselines. Built on top of PyTorch, EMSServe introduces a modality-aware model splitter and a feature caching mechanism, achieving adaptive and efficient inference across heterogeneous hardware while addressing the challenge of asynchronous modality arrival in the field. By optimizing multimodal inference execution in EMS scenarios, EMSServe achieves 1.9x -- 11.7x speedup over direct PyTorch multimodal inference. A user study evaluation with six professional EMTs demonstrates that EMSGlass enhances real-time situational awareness, decision-making speed, and operational efficiency through intuitive on-glass interaction. In addition, qualitative insights from the user study provide actionable directions for extending EMSGlass toward next-generation AI-enabled EMS systems, bridging multimodal intelligence with real-world emergency response workflows.

Paper Structure

This paper contains 39 sections, 17 figures, 7 tables.

Figures (17)

  • Figure 1: While moving, the EMT perceives multimodal EMS data at different times and selects the protocol "Medical-Respiratory".
  • Figure 2: Overview of EMSNet for EMS
  • Figure 3: Pipeline of the multimodal data processor to prepare 4 datasets for training our EMSFoundation model: D1(2-modal: text, vitals), D2 (3-modal: text, vitals, scene), D3(audio), and D4(image).
  • Figure 4: (a)-(b): Whisper-s and Whisper-m achieve lower error rates and improved generalization across user accents and microphones. (c)-(e): Spectrograms show high-frequency signal loss in Google Glass due to an 8kHz cutoff.
  • Figure 5: With different text prompts(left), we evaluate (right) Grounding Dino on the collected EMS scene image dataset containing pills, alcohol, and medicine bottles.
  • ...and 12 more figures