FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices
Yuji Chai, Mujin Kwen, David Brooks, Gu-Yeon Wei
TL;DR
FlexQuant tackles memory elasticity for locally hosted LLMs on edge devices by building an ensemble of Elastic Quantization Models (EQMs) that enables fine-grained memory-footprint adjustments. It combines a one-way, Monte Carlo Tree Search–like navigation with a sensitivity-guided pruning strategy to select high-quality sequences of module replacements between $QM(n_{up})$ and $QM(n_{low})$ without extra storage beyond the base QMs. The approach achieves a $15x$ granularity improvement and a $10x$ storage reduction over state-of-the-art elastic hosting, while maintaining or surpassing baseline accuracy on downstream tasks; pruning further reduces storage by up to $40\%$. Evaluations on Llama-1/2/3 variants show smooth memory-accuracy trade-offs and competitive task performance, illustrating practical viability for edge deployments with privacy and offline capability.
Abstract
Deploying LLMs on edge devices presents serious technical challenges. Memory elasticity is crucial for edge devices with unified memory, where memory is shared and fluctuates dynamically. Existing solutions suffer from either poor transition granularity or high storage costs. We propose FlexQuant, a novel elasticity framework that generates an ensemble of quantized models, providing an elastic hosting solution with 15x granularity improvement and 10x storage reduction compared to SoTA methods. FlexQuant works with most quantization methods and creates a family of trade-off options under various storage limits through our pruning method. It brings great performance and flexibility to the edge deployment of LLMs.
