Table of Contents
Fetching ...

Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning

Leon Nissen, Philipp Zagar, Vishnu Ravi, Aydin Zahedivash, Lara Marie Reimer, Stephan Jonas, Oliver Aalami, Paul Schmiedmayer

TL;DR

This work evaluates the feasibility of running medical-domain LLMs entirely on mobile devices using the AMEGA benchmark within the HealthBench iOS framework. By converting models to MLX format and applying 4-bit quantization, the study measures accuracy, throughput, and thermal behavior across a diverse set of devices, revealing memory capacity as the primary bottleneck. Med42 and Aloe emerge as the most accurate medical models, while Phi-3 Mini delivers a favorable balance of speed and size, illustrating that model architecture and data matter as much as parameter count. The findings underscore the viability of on-device, privacy-preserving clinical reasoning and highlight the need for efficient inference and domain-tailored models to enable deployment on a broad range of devices.

Abstract

The deployment of Large Language Models (LLM) on mobile devices offers significant potential for medical applications, enhancing privacy, security, and cost-efficiency by eliminating reliance on cloud-based services and keeping sensitive health data local. However, the performance and accuracy of on-device LLMs in real-world medical contexts remain underexplored. In this study, we benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating accuracy, computational efficiency, and thermal limitation across various mobile devices. Our results indicate that compact general-purpose models like Phi-3 Mini achieve a strong balance between speed and accuracy, while medically fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably, deploying LLMs on older devices remains feasible, with memory constraints posing a greater challenge than raw processing power. Our study underscores the potential of on-device LLMs for healthcare while emphasizing the need for more efficient inference and models tailored to real-world clinical reasoning.

Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning

TL;DR

This work evaluates the feasibility of running medical-domain LLMs entirely on mobile devices using the AMEGA benchmark within the HealthBench iOS framework. By converting models to MLX format and applying 4-bit quantization, the study measures accuracy, throughput, and thermal behavior across a diverse set of devices, revealing memory capacity as the primary bottleneck. Med42 and Aloe emerge as the most accurate medical models, while Phi-3 Mini delivers a favorable balance of speed and size, illustrating that model architecture and data matter as much as parameter count. The findings underscore the viability of on-device, privacy-preserving clinical reasoning and highlight the need for efficient inference and domain-tailored models to enable deployment on a broad range of devices.

Abstract

The deployment of Large Language Models (LLM) on mobile devices offers significant potential for medical applications, enhancing privacy, security, and cost-efficiency by eliminating reliance on cloud-based services and keeping sensitive health data local. However, the performance and accuracy of on-device LLMs in real-world medical contexts remain underexplored. In this study, we benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating accuracy, computational efficiency, and thermal limitation across various mobile devices. Our results indicate that compact general-purpose models like Phi-3 Mini achieve a strong balance between speed and accuracy, while medically fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably, deploying LLMs on older devices remains feasible, with memory constraints posing a greater challenge than raw processing power. Our study underscores the potential of on-device LLMs for healthcare while emphasizing the need for more efficient inference and models tailored to real-world clinical reasoning.

Paper Structure

This paper contains 21 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Performance Evaluation Across Devices and Models.
  • Figure 2: Output tokens per second compared to the thermal state of all iPhones over all models.
  • Figure 3: Performance comparison of LLM across different devices. Plot (a) shows the AMEGA score relative to the model's parameter size (in billions), while plot (b) visualizes the trade-off between AMEGA score and output tokens per second. Different colors represent different models, and marker shapes indicate devices. The dashed lines highlight the mean values.