Table of Contents
Fetching ...

MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases

Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese

TL;DR

MobileAIBench addresses the challenge of deploying LLMs and LMMs on mobile devices by introducing a two-part benchmarking framework that covers quantized models up to 7B parameters across NLP, multimodal, and trust/safety tasks. The desktop evaluation library enables broad, task-oriented benchmarking, while the on-device iOS app measures real-world latency and hardware utilization on an iPhone 14, providing end-to-end insights into feasibility. The framework uses a 20-dataset suite with quantization levels from 16-bit down to 3-bit, and supports Huggingface and Llama.cpp to facilitate open-model evaluation and deployment. Key findings show larger models generally outperform smaller ones, quantization imposes modest performance changes for many tasks, and on-device resource demands remain a major constraint, highlighting the need for more compact mobile-ready models and broader tooling for mobile AI deployment.

Abstract

The deployment of Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices has gained significant attention due to the benefits of enhanced privacy, stability, and personalization. However, the hardware constraints of mobile devices necessitate the use of models with fewer parameters and model compression techniques like quantization. Currently, there is limited understanding of quantization's impact on various task performances, including LLM tasks, LMM tasks, and, critically, trust and safety. There is a lack of adequate tools for systematically testing these models on mobile devices. To address these gaps, we introduce MobileAIBench, a comprehensive benchmarking framework for evaluating mobile-optimized LLMs and LMMs. MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices. Our two-part open-source framework includes a library for running evaluations on desktops and an iOS app for on-device latency and hardware utilization measurements. Our thorough analysis aims to accelerate mobile AI research and deployment by providing insights into the performance and feasibility of deploying LLMs and LMMs on mobile platforms.

MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases

TL;DR

MobileAIBench addresses the challenge of deploying LLMs and LMMs on mobile devices by introducing a two-part benchmarking framework that covers quantized models up to 7B parameters across NLP, multimodal, and trust/safety tasks. The desktop evaluation library enables broad, task-oriented benchmarking, while the on-device iOS app measures real-world latency and hardware utilization on an iPhone 14, providing end-to-end insights into feasibility. The framework uses a 20-dataset suite with quantization levels from 16-bit down to 3-bit, and supports Huggingface and Llama.cpp to facilitate open-model evaluation and deployment. Key findings show larger models generally outperform smaller ones, quantization imposes modest performance changes for many tasks, and on-device resource demands remain a major constraint, highlighting the need for more compact mobile-ready models and broader tooling for mobile AI deployment.

Abstract

The deployment of Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices has gained significant attention due to the benefits of enhanced privacy, stability, and personalization. However, the hardware constraints of mobile devices necessitate the use of models with fewer parameters and model compression techniques like quantization. Currently, there is limited understanding of quantization's impact on various task performances, including LLM tasks, LMM tasks, and, critically, trust and safety. There is a lack of adequate tools for systematically testing these models on mobile devices. To address these gaps, we introduce MobileAIBench, a comprehensive benchmarking framework for evaluating mobile-optimized LLMs and LMMs. MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices. Our two-part open-source framework includes a library for running evaluations on desktops and an iOS app for on-device latency and hardware utilization measurements. Our thorough analysis aims to accelerate mobile AI research and deployment by providing insights into the performance and feasibility of deploying LLMs and LMMs on mobile platforms.
Paper Structure (26 sections, 6 figures, 9 tables)

This paper contains 26 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: MobileAIBench Architecture
  • Figure 2: MobileAIBench iOS app
  • Figure 3: Performance change of LMMs under different quantization.
  • Figure 4: Trade-off between accuracy and disk usage under 4-bit quantization.
  • Figure 5: Distribution of performance changes: (a) per LLM, (b) per task, when transitioning from 16-bit to 8-bit quantization.
  • ...and 1 more figures