LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment
Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng Li, Shangwen Wang, Xinran Hong, Yongtao Tang
TL;DR
This work addresses the challenge of deploying large language models on resource-constrained edge devices by introducing Layer-Specific Adaptive Quantization (LSAQ). LSAQ assigns per-layer quantization precision based on a novel layer-importance metric derived from top-k token sets and Jaccard similarity, and adapts deployment strategies according to available GPU resources via an offline/online framework. The method comprises modules for layer-importance detection, resource detection, quantization-strategy formulation, and per-channel model quantization, culminating in feasible edge deployments with reduced memory and preserved accuracy. Empirical results on Llama-2-7B/13B and Llama-3-8B show that LSAQ improves zero-shot task performance and perplexity relative to a cosine-similarity-based baseline while enabling deployment on mainstream GPUs with significantly lower memory footprints.
Abstract
As Large Language Models (LLMs) demonstrate exceptional performance across various domains, deploying LLMs on edge devices has emerged as a new trend. Quantization techniques, which reduce the size and memory requirements of LLMs, are effective for deploying LLMs on resource-limited edge devices. However, existing one-size-fits-all quantization methods often fail to dynamically adjust the memory requirements of LLMs, limiting their applications to practical edge devices with various computation resources. To tackle this issue, we propose Layer-Specific Adaptive Quantization (LSAQ), a system for adaptive quantization and dynamic deployment of LLMs based on layer importance. Specifically, LSAQ evaluates the importance of LLMs' neural layers by constructing top-k token sets from the inputs and outputs of each layer and calculating their Jaccard similarity. Based on layer importance, our system adaptively adjusts quantization strategies in real time according to the computation resource of edge devices, which applies higher quantization precision to layers with higher importance, and vice versa. {Experimental results show that LSAQ consistently outperforms the selected quantization baselines in terms of perplexity and zero-shot tasks. Additionally, it can devise appropriate quantization schemes for different usage scenarios to facilitate the deployment of LLMs.
