Table of Contents
Fetching ...

LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits

Amir Reza Mirzaei, Yuqiao Wen, Yanshuai Cao, Lili Mou

TL;DR

LoRAQuant tackles the memory burden of deploying multiple LoRA adapters for LLM customization by introducing a mixed-precision, SVD-based post-training quantization method. It splits each LoRA update into two sub-LoRAs and allocates more bits to the most informative part while quantizing the rest to ultra-low precision, aided by gradient-based optimization to minimize quantization error. Across LLaMA 2-7B/13B and Mistral 7B on math reasoning, coding, and summarization, it achieves average bitwidth below 2 with performance comparable to full-precision baselines and stronger baselines, demonstrating practical scalability for multi-LoRA scenarios. The approach offers substantial memory savings for LLM customization without sacrificing task performance, enabling broader and more efficient deployment of personalized adapters.

Abstract

Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks. Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Results show that our LoRAQuant uses significantly lower bits than other quantization methods, but achieves comparable or even higher performance.

LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits

TL;DR

LoRAQuant tackles the memory burden of deploying multiple LoRA adapters for LLM customization by introducing a mixed-precision, SVD-based post-training quantization method. It splits each LoRA update into two sub-LoRAs and allocates more bits to the most informative part while quantizing the rest to ultra-low precision, aided by gradient-based optimization to minimize quantization error. Across LLaMA 2-7B/13B and Mistral 7B on math reasoning, coding, and summarization, it achieves average bitwidth below 2 with performance comparable to full-precision baselines and stronger baselines, demonstrating practical scalability for multi-LoRA scenarios. The approach offers substantial memory savings for LLM customization without sacrificing task performance, enabling broader and more efficient deployment of personalized adapters.

Abstract

Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks. Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Results show that our LoRAQuant uses significantly lower bits than other quantization methods, but achieves comparable or even higher performance.

Paper Structure

This paper contains 18 sections, 10 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of our LoraQuant method.
  • Figure 2: Comparison of sub-LoRA splitting strategies. Here, $h$ denotes the rank of the high-precision sub-LoRA and is fixed globally for all LoRAs in a model.
  • Figure 3: Study on optimization and quantization of LoraQuant. LoraQuant is the proposed method. Prune truncates the less important sub-LoRA components. No Opt omits the gradient-based optimization step. LoraQuant w/ RTN replaces the specialized binarization with 1-bit RTN quantization.
  • Figure 4: Comparison of $h$ selection strategy. Ratio denotes our method explained in §\ref{['3.1']}, where the ratio hyperparameter varies from 0.1 to 0.95 in increments of 0.05, while Static sets $h$ to a fixed value ranging from 1 to 12.
  • Figure 5: Study on the column-wise and row-wise quantization of LoraQuant. Each entry is denoted as B (_) A (_), where each underscore can be either col or row. Here, col indicates column-wise quantization and row indicates row-wise quantization of the corresponding component.
  • ...and 1 more figures