Table of Contents
Fetching ...

UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu

TL;DR

UniQL tackles the challenge of deploying large language models on edge devices with dynamic resource constraints by unifying post-training quantization and structured pruning into a single cloud-assisted, one-shot workflow. It introduces structured weight sorting, quantization-aware decompositions, and fused RoPE kernels to support Transformers, SSMs, and hybrids, enabling on-device adaptive pruning up to 35% and substantial memory and latency gains with minimal accuracy loss. Across multiple models and tasks, UniQL demonstrates competitive or superior performance to PTQ and pruning baselines, while delivering flexible, architecture-agnostic deployment suitable for edge and mobile scenarios. The work also provides detailed ablations and extensive hardware profiling, highlighting practical benefits in energy efficiency and Pareto-optimal trade-offs for edge-LMM inference.

Abstract

Deploying large language models (LLMs) on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.

UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

TL;DR

UniQL tackles the challenge of deploying large language models on edge devices with dynamic resource constraints by unifying post-training quantization and structured pruning into a single cloud-assisted, one-shot workflow. It introduces structured weight sorting, quantization-aware decompositions, and fused RoPE kernels to support Transformers, SSMs, and hybrids, enabling on-device adaptive pruning up to 35% and substantial memory and latency gains with minimal accuracy loss. Across multiple models and tasks, UniQL demonstrates competitive or superior performance to PTQ and pruning baselines, while delivering flexible, architecture-agnostic deployment suitable for edge and mobile scenarios. The work also provides detailed ablations and extensive hardware profiling, highlighting practical benefits in energy efficiency and Pareto-optimal trade-offs for edge-LMM inference.

Abstract

Deploying large language models (LLMs) on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.

Paper Structure

This paper contains 48 sections, 4 equations, 8 figures, 20 tables, 7 algorithms.

Figures (8)

  • Figure 1: (Proposed framework overview.) UniQL supports Transformers, SSMs, and hybrid models, enabling one-shot compression using a single server-class GPU. The on-device pruning of the quantized model is feasible and configurable based on the current device workload. We present actual latency on Nano 8G in relation to accuracy for different pruning rates across three distinct models on the right. Circle sizes correspond to model sizes. Latency is measured using 512 prefilling tokens and 512 generated tokens on Nano.
  • Figure 2: (The UniQL pipeline.) We devise pseudo-inverse-free, quantization-aware, and state-aware matrix decomposition methods for the grouped weights to obtain sorted weights (a). During fine-tuning, we sample global pruning rates, and masked out the weight channels (b). The refined patches are fused into the weights, followed by model quantization for deployment (c). Based on the system utilization, we perform on-device adaptive pruning of the quantized model (d).
  • Figure 3: (Joint weight decomposition.) We visualize the group of sorted weights in MLP (a), MHSA (b), and Mamba (c) blocks. The group of weights for joint decomposition is shown in the same background color, e.g.,$\mathbf{W}_q$ and $\mathbf{W}_k$ in the pink background, and other groups are distinguished by different colors. We devise different types of joint compression algorithms that are efficient and quantization-aware to support on-device pruning.
  • Figure 4: (The fused kernel and SVD decomposition.) In the left illustration, gathering and slicing rotary positional embeddings by the index vector for $Q$ and $K$ are fused in one kernel to reduce memory access. The embeddings for the pruned head dimension $D'_\mathrm{hd}$ are gathered from the index array $\mathbf{S}_{sym}$ in the fused kernel. On the right, we combine the diagonal matrix $\mathbf{\Sigma}$ with $\mathbf{U}$ as the group shares a quantization scaling factor to reduce the quantization errors.
  • Figure 5: (Pareto-front analysis on A6000.) We evaluate the trade-off between average accuracy (%) and time-to-last-token (sec.) for various LLMs under different quantization and pruning configurations. Circle, square, and star markers denote GPTQ (W4A16), FP16, and our proposed UniQL (W4A16), respectively. Marker size indicates memory footprint.
  • ...and 3 more figures