Table of Contents
Fetching ...

PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms

Yilong Li, Jingyu Liu, Hao Zhang, M Badri Narayanan, Utkarsh Sharma, Shuai Zhang, Pan Hu, Yijing Zeng, Jayaram Raghuram, Suman Banerjee

TL;DR

PalmBench addresses the need for robust on-device benchmarking of compressed LLMs on mobile hardware. It introduces an automated framework and cross-platform methodology to evaluate resource use, throughput, accuracy, and harmful outputs across diverse devices and quantization schemes, using frameworks such as MLC-LMM and llama.cpp, and datasets including SQuAD, Natural Questions, MT-Bench, HaluEval, and TruthfulQA. The study provides actionable insights into how quantization (notably 4-bit and related group-wise methods) affects memory, GPU workload, power, temperature, and output quality, while revealing platform-specific performance (notably iOS efficiency). The work offers practical guidance for deploying mobile LLMs and highlights safety concerns—hallucinations, toxicity—in compressed models, informing model selection, quantization strategy, and hardware choices for real-world, privacy-preserving on-device AI.

Abstract

Deploying large language models (LLMs) locally on mobile devices is advantageous in scenarios where transmitting data to remote cloud servers is either undesirable due to privacy concerns or impractical due to network connection. Recent advancements (MLC, 2023a; Gerganov, 2023) have facilitated the local deployment of LLMs. However, local deployment also presents challenges, particularly in balancing quality (generative performance), latency, and throughput within the hardware constraints of mobile devices. In this paper, we introduce our lightweight, all-in-one automated benchmarking framework that allows users to evaluate LLMs on mobile devices. We provide a comprehensive benchmark of various popular LLMs with different quantization configurations (both weights and activations) across multiple mobile platforms with varying hardware capabilities. Unlike traditional benchmarks that assess full-scale models on high-end GPU clusters, we focus on evaluating resource efficiency (memory and power consumption) and harmful output for compressed models on mobile devices. Our key observations include i) differences in energy efficiency and throughput across mobile platforms; ii) the impact of quantization on memory usage, GPU execution time, and power consumption; and iii) accuracy and performance degradation of quantized models compared to their non-quantized counterparts; and iv) the frequency of hallucinations and toxic content generated by compressed LLMs on mobile devices.

PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms

TL;DR

PalmBench addresses the need for robust on-device benchmarking of compressed LLMs on mobile hardware. It introduces an automated framework and cross-platform methodology to evaluate resource use, throughput, accuracy, and harmful outputs across diverse devices and quantization schemes, using frameworks such as MLC-LMM and llama.cpp, and datasets including SQuAD, Natural Questions, MT-Bench, HaluEval, and TruthfulQA. The study provides actionable insights into how quantization (notably 4-bit and related group-wise methods) affects memory, GPU workload, power, temperature, and output quality, while revealing platform-specific performance (notably iOS efficiency). The work offers practical guidance for deploying mobile LLMs and highlights safety concerns—hallucinations, toxicity—in compressed models, informing model selection, quantization strategy, and hardware choices for real-world, privacy-preserving on-device AI.

Abstract

Deploying large language models (LLMs) locally on mobile devices is advantageous in scenarios where transmitting data to remote cloud servers is either undesirable due to privacy concerns or impractical due to network connection. Recent advancements (MLC, 2023a; Gerganov, 2023) have facilitated the local deployment of LLMs. However, local deployment also presents challenges, particularly in balancing quality (generative performance), latency, and throughput within the hardware constraints of mobile devices. In this paper, we introduce our lightweight, all-in-one automated benchmarking framework that allows users to evaluate LLMs on mobile devices. We provide a comprehensive benchmark of various popular LLMs with different quantization configurations (both weights and activations) across multiple mobile platforms with varying hardware capabilities. Unlike traditional benchmarks that assess full-scale models on high-end GPU clusters, we focus on evaluating resource efficiency (memory and power consumption) and harmful output for compressed models on mobile devices. Our key observations include i) differences in energy efficiency and throughput across mobile platforms; ii) the impact of quantization on memory usage, GPU execution time, and power consumption; and iii) accuracy and performance degradation of quantized models compared to their non-quantized counterparts; and iv) the frequency of hallucinations and toxic content generated by compressed LLMs on mobile devices.
Paper Structure (36 sections, 13 figures, 12 tables)

This paper contains 36 sections, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Overview and workflow of PalmBench -- our evaluation and benchmarking framework for Large Language Models (LLMs) on mobile devices.
  • Figure 2: Average memory usage (GB) while running MLC and llama.cpp.
  • Figure 3: CPU and GPU usage during inference of RedPajama-INCITE-3B across different quantizations.
  • Figure 4: GPU Utilization (%) timeline for 3-bit and 4-bit quantized RedPajama models on Google Pixel 7.
  • Figure 5: GPU memory read/write speed while running LLaMa-3-8B-Instruct in 3-bit and 4-bit quantization on Pixel 7.
  • ...and 8 more figures