Table of Contents
Fetching ...

FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices

Yuji Chai, Mujin Kwen, David Brooks, Gu-Yeon Wei

TL;DR

FlexQuant tackles memory elasticity for locally hosted LLMs on edge devices by building an ensemble of Elastic Quantization Models (EQMs) that enables fine-grained memory-footprint adjustments. It combines a one-way, Monte Carlo Tree Search–like navigation with a sensitivity-guided pruning strategy to select high-quality sequences of module replacements between $QM(n_{up})$ and $QM(n_{low})$ without extra storage beyond the base QMs. The approach achieves a $15x$ granularity improvement and a $10x$ storage reduction over state-of-the-art elastic hosting, while maintaining or surpassing baseline accuracy on downstream tasks; pruning further reduces storage by up to $40\%$. Evaluations on Llama-1/2/3 variants show smooth memory-accuracy trade-offs and competitive task performance, illustrating practical viability for edge deployments with privacy and offline capability.

Abstract

Deploying LLMs on edge devices presents serious technical challenges. Memory elasticity is crucial for edge devices with unified memory, where memory is shared and fluctuates dynamically. Existing solutions suffer from either poor transition granularity or high storage costs. We propose FlexQuant, a novel elasticity framework that generates an ensemble of quantized models, providing an elastic hosting solution with 15x granularity improvement and 10x storage reduction compared to SoTA methods. FlexQuant works with most quantization methods and creates a family of trade-off options under various storage limits through our pruning method. It brings great performance and flexibility to the edge deployment of LLMs.

FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices

TL;DR

FlexQuant tackles memory elasticity for locally hosted LLMs on edge devices by building an ensemble of Elastic Quantization Models (EQMs) that enables fine-grained memory-footprint adjustments. It combines a one-way, Monte Carlo Tree Search–like navigation with a sensitivity-guided pruning strategy to select high-quality sequences of module replacements between and without extra storage beyond the base QMs. The approach achieves a granularity improvement and a storage reduction over state-of-the-art elastic hosting, while maintaining or surpassing baseline accuracy on downstream tasks; pruning further reduces storage by up to . Evaluations on Llama-1/2/3 variants show smooth memory-accuracy trade-offs and competitive task performance, illustrating practical viability for edge deployments with privacy and offline capability.

Abstract

Deploying LLMs on edge devices presents serious technical challenges. Memory elasticity is crucial for edge devices with unified memory, where memory is shared and fluctuates dynamically. Existing solutions suffer from either poor transition granularity or high storage costs. We propose FlexQuant, a novel elasticity framework that generates an ensemble of quantized models, providing an elastic hosting solution with 15x granularity improvement and 10x storage reduction compared to SoTA methods. FlexQuant works with most quantization methods and creates a family of trade-off options under various storage limits through our pruning method. It brings great performance and flexibility to the edge deployment of LLMs.
Paper Structure (15 sections, 5 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Perplexity comparison of different elastic hosting methods of quantized Llama $2$$7$B model. It compares FlexQuant with the baseline method when they are using two quantization methods, ExLlamaV2 and AnyPrecision.
  • Figure 2: An example of FlexQuant's tree search for EQM ensemble. The shown search process has two exploitation stems and three exploration branches.
  • Figure 3: Perplexity comparison between Base-Ex, Base-AP, FQ-Ex, and PFQ-Ex at different pruning rate. Results for AnyPrecision on Llama 3 8B is omitted due to lack of support in their implementation. The footprint range is different due to differences in parameter count and model architecture.
  • Figure 4: Perplexity vs pruning rate at varying memory footprint bounds for FQ-Ex on Llama 1, Llama 2 and Llama 3
  • Figure 5: Downstream perplexity comparison of quantized Llama models between Base-Ex FQ-Ex and PFQ-Ex at different pruning rate.