Table of Contents
Fetching ...

DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park

TL;DR

DP-LLM tackles the challenge of running on-device LLMs under diverse latency and accuracy constraints by introducing a dynamic layer-wise precision mechanism. It defines per-layer candidate bitwidths and uses a lightweight precision selector guided by a relative-error proxy to adjust precision at each decoding step, enabling non-integer and fine-grained quantization Granularity within a memory budget. The approach combines offline optimization (phase-wise precision assignment) with runtime estimation (hybrid linear/regression and JL-based methods, plus asynchronous updates) to achieve a favorable performance-latency trade-off, validated across multiple models and benchmarks with minimal overhead. Overall, DP-LLM demonstrates that exploiting the dynamic sensitivity of layers during decoding yields significant gains over static mixed-precision baselines for on-device LLM inference.

Abstract

How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.

DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

TL;DR

DP-LLM tackles the challenge of running on-device LLMs under diverse latency and accuracy constraints by introducing a dynamic layer-wise precision mechanism. It defines per-layer candidate bitwidths and uses a lightweight precision selector guided by a relative-error proxy to adjust precision at each decoding step, enabling non-integer and fine-grained quantization Granularity within a memory budget. The approach combines offline optimization (phase-wise precision assignment) with runtime estimation (hybrid linear/regression and JL-based methods, plus asynchronous updates) to achieve a favorable performance-latency trade-off, validated across multiple models and benchmarks with minimal overhead. Overall, DP-LLM demonstrates that exploiting the dynamic sensitivity of layers during decoding yields significant gains over static mixed-precision baselines for on-device LLM inference.

Abstract

How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.

Paper Structure

This paper contains 49 sections, 8 equations, 11 figures, 13 tables, 1 algorithm.

Figures (11)

  • Figure 1: Runtime model adaptation
  • Figure 2: Different precision assignment schemes
  • Figure 3: (a) Sensitivity of different layers at each decoding step. (b) Perplexity trend of different precision assignment schemes.
  • Figure 4: Overview of DP-LLM
  • Figure 5: Overview of layer-wise candidate precision set and threshold assignment
  • ...and 6 more figures