Table of Contents
Fetching ...

A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources

Qingyu Song, Rui Liu, Wei Lin, Peiyu Liao, Wenqian Zhao, Yiwen Wang, Shoubo Hu, Yining Jiang, Mochun Long, Hui-Ling Zhen, Ning Jiang, Mingxuan Yuan, Qiao Xiang, Hong Xu

TL;DR

This work introduces a systematic methodology to evaluate on-device LLMs, balancing capability, efficiency, and resource constraints, and offers guidelines for optimizing LLMs in resource-constrained edge environments.

Abstract

Deploying Large Language Models (LLMs) on edge devices enhances privacy but faces performance hurdles due to limited resources. We introduce a systematic methodology to evaluate on-device LLMs, balancing capability, efficiency, and resource constraints. Through an extensive analysis of models (0.5B-14B) and seven post-training quantization (PTQ) methods on commodity hardware, we demonstrate that: 1) Heavily quantized large models consistently outperform smaller, high-precision models, with a performance threshold at ~3.5 effective bits-per-weight (BPW); 2) Resource utilization scales linearly with BPW, though power and memory footprints vary by quantization algorithm; and 3) With a reduction in model size, the primary constraint on throughput transitions from communication overhead to computational latency. We conclude by offering guidelines for optimizing LLMs in resource-constrained edge environments. Our codebase is available at https://anonymous.4open.science/r/LLMOnDevice/.

A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources

TL;DR

This work introduces a systematic methodology to evaluate on-device LLMs, balancing capability, efficiency, and resource constraints, and offers guidelines for optimizing LLMs in resource-constrained edge environments.

Abstract

Deploying Large Language Models (LLMs) on edge devices enhances privacy but faces performance hurdles due to limited resources. We introduce a systematic methodology to evaluate on-device LLMs, balancing capability, efficiency, and resource constraints. Through an extensive analysis of models (0.5B-14B) and seven post-training quantization (PTQ) methods on commodity hardware, we demonstrate that: 1) Heavily quantized large models consistently outperform smaller, high-precision models, with a performance threshold at ~3.5 effective bits-per-weight (BPW); 2) Resource utilization scales linearly with BPW, though power and memory footprints vary by quantization algorithm; and 3) With a reduction in model size, the primary constraint on throughput transitions from communication overhead to computational latency. We conclude by offering guidelines for optimizing LLMs in resource-constrained edge environments. Our codebase is available at https://anonymous.4open.science/r/LLMOnDevice/.

Paper Structure

This paper contains 18 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Throughput across Quantization Methods (Solid: qn_k, Dahsed: qn_0)
  • Figure 2: Throughput across Token Lengths (Solid: q5_k, Dahsed: q5_0)
  • Figure 3: Throughput across CPU Resources (Solid: Prefilling, Dahsed: Decoding)
  • Figure 4: qn_0/qn_k Throughput Ratio across Token Lengths (Solid: Prefilling, Dahsed: Decoding), Laptop with VNNI (Left), Testbed with VNNI (Middle), Testbed w/o VNNI (Right).
  • Figure 5: System Resource Utilization for Qwen 2.5 Models with 128-token Inputs and 1000-token Outputs.
  • ...and 3 more figures