Table of Contents
Fetching ...

LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment

Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng Li, Shangwen Wang, Xinran Hong, Yongtao Tang

TL;DR

This work addresses the challenge of deploying large language models on resource-constrained edge devices by introducing Layer-Specific Adaptive Quantization (LSAQ). LSAQ assigns per-layer quantization precision based on a novel layer-importance metric derived from top-k token sets and Jaccard similarity, and adapts deployment strategies according to available GPU resources via an offline/online framework. The method comprises modules for layer-importance detection, resource detection, quantization-strategy formulation, and per-channel model quantization, culminating in feasible edge deployments with reduced memory and preserved accuracy. Empirical results on Llama-2-7B/13B and Llama-3-8B show that LSAQ improves zero-shot task performance and perplexity relative to a cosine-similarity-based baseline while enabling deployment on mainstream GPUs with significantly lower memory footprints.

Abstract

As Large Language Models (LLMs) demonstrate exceptional performance across various domains, deploying LLMs on edge devices has emerged as a new trend. Quantization techniques, which reduce the size and memory requirements of LLMs, are effective for deploying LLMs on resource-limited edge devices. However, existing one-size-fits-all quantization methods often fail to dynamically adjust the memory requirements of LLMs, limiting their applications to practical edge devices with various computation resources. To tackle this issue, we propose Layer-Specific Adaptive Quantization (LSAQ), a system for adaptive quantization and dynamic deployment of LLMs based on layer importance. Specifically, LSAQ evaluates the importance of LLMs' neural layers by constructing top-k token sets from the inputs and outputs of each layer and calculating their Jaccard similarity. Based on layer importance, our system adaptively adjusts quantization strategies in real time according to the computation resource of edge devices, which applies higher quantization precision to layers with higher importance, and vice versa. {Experimental results show that LSAQ consistently outperforms the selected quantization baselines in terms of perplexity and zero-shot tasks. Additionally, it can devise appropriate quantization schemes for different usage scenarios to facilitate the deployment of LLMs.

LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment

TL;DR

This work addresses the challenge of deploying large language models on resource-constrained edge devices by introducing Layer-Specific Adaptive Quantization (LSAQ). LSAQ assigns per-layer quantization precision based on a novel layer-importance metric derived from top-k token sets and Jaccard similarity, and adapts deployment strategies according to available GPU resources via an offline/online framework. The method comprises modules for layer-importance detection, resource detection, quantization-strategy formulation, and per-channel model quantization, culminating in feasible edge deployments with reduced memory and preserved accuracy. Empirical results on Llama-2-7B/13B and Llama-3-8B show that LSAQ improves zero-shot task performance and perplexity relative to a cosine-similarity-based baseline while enabling deployment on mainstream GPUs with significantly lower memory footprints.

Abstract

As Large Language Models (LLMs) demonstrate exceptional performance across various domains, deploying LLMs on edge devices has emerged as a new trend. Quantization techniques, which reduce the size and memory requirements of LLMs, are effective for deploying LLMs on resource-limited edge devices. However, existing one-size-fits-all quantization methods often fail to dynamically adjust the memory requirements of LLMs, limiting their applications to practical edge devices with various computation resources. To tackle this issue, we propose Layer-Specific Adaptive Quantization (LSAQ), a system for adaptive quantization and dynamic deployment of LLMs based on layer importance. Specifically, LSAQ evaluates the importance of LLMs' neural layers by constructing top-k token sets from the inputs and outputs of each layer and calculating their Jaccard similarity. Based on layer importance, our system adaptively adjusts quantization strategies in real time according to the computation resource of edge devices, which applies higher quantization precision to layers with higher importance, and vice versa. {Experimental results show that LSAQ consistently outperforms the selected quantization baselines in terms of perplexity and zero-shot tasks. Additionally, it can devise appropriate quantization schemes for different usage scenarios to facilitate the deployment of LLMs.

Paper Structure

This paper contains 21 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The framework of LSAQ. It is composed of offline and online parts. In the offline part, the importance of each layer of the LLM is first obtained, and the available GPU resources at the current moment are detected simultaneously. Based on this, a quantization strategy is meticulously formulated. Subsequently, this quantization strategy is transmitted to the online part, where the model is quantized according to this strategy.
  • Figure 2: The process of constructing top-$k$ token sets.
  • Figure 3: Importance of LLMs layer.
  • Figure 4: Memory usage of quantized model