A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources

Qingyu Song; Rui Liu; Wei Lin; Peiyu Liao; Wenqian Zhao; Yiwen Wang; Shoubo Hu; Yining Jiang; Mochun Long; Hui-Ling Zhen; Ning Jiang; Mingxuan Yuan; Qiao Xiang; Hong Xu

A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources

Qingyu Song, Rui Liu, Wei Lin, Peiyu Liao, Wenqian Zhao, Yiwen Wang, Shoubo Hu, Yining Jiang, Mochun Long, Hui-Ling Zhen, Ning Jiang, Mingxuan Yuan, Qiao Xiang, Hong Xu

TL;DR

This work introduces a systematic methodology to evaluate on-device LLMs, balancing capability, efficiency, and resource constraints, and offers guidelines for optimizing LLMs in resource-constrained edge environments.

Abstract

Deploying Large Language Models (LLMs) on edge devices enhances privacy but faces performance hurdles due to limited resources. We introduce a systematic methodology to evaluate on-device LLMs, balancing capability, efficiency, and resource constraints. Through an extensive analysis of models (0.5B-14B) and seven post-training quantization (PTQ) methods on commodity hardware, we demonstrate that: 1) Heavily quantized large models consistently outperform smaller, high-precision models, with a performance threshold at ~3.5 effective bits-per-weight (BPW); 2) Resource utilization scales linearly with BPW, though power and memory footprints vary by quantization algorithm; and 3) With a reduction in model size, the primary constraint on throughput transitions from communication overhead to computational latency. We conclude by offering guidelines for optimizing LLMs in resource-constrained edge environments. Our codebase is available at https://anonymous.4open.science/r/LLMOnDevice/.

A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources

TL;DR

Abstract

A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)