Table of Contents
Fetching ...

AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning

Changhai Zhou, Shiyang Zhang, Yuhua Zhou, Qian Qiao, Jun Gao, Cheng Jin, Kaizhou Qin, Weizhong Zhang

TL;DR

AutoQRA is proposed, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process and achieves performance close to full-precision fine-tuning with a memory footprint comparable to uniform 4-bit methods.

Abstract

Quantization followed by parameter-efficient fine-tuning has emerged as a promising paradigm for downstream adaptation under tight GPU memory constraints. However, this sequential pipeline fails to leverage the intricate interaction between quantization bit-width and LoRA rank. Specifically, a carefully optimized quantization allocation with low quantization error does not always translate to strong fine-tuning performance, and different bit-width and rank configurations can lead to significantly varying outcomes under the same memory budget. To address this limitation, we propose AutoQRA, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process. To tackle the challenges posed by the large discrete search space and the high evaluation cost associated with frequent fine-tuning iterations, AutoQRA decomposes the optimization process into two stages. First, it first conducts a global multi-fidelity evolutionary search, where the initial population is warm-started by injecting layer-wise importance priors. This stage employs specific operators and a performance model to efficiently screen candidate configurations. Second, trust-region Bayesian optimization is applied to locally refine promising regions of the search space and identify optimal configurations under the given memory budget. This approach enables active compensation for quantization noise in specific layers during training. Experiments show that AutoQRA achieves performance close to full-precision fine-tuning with a memory footprint comparable to uniform 4-bit methods.

AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning

TL;DR

AutoQRA is proposed, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process and achieves performance close to full-precision fine-tuning with a memory footprint comparable to uniform 4-bit methods.

Abstract

Quantization followed by parameter-efficient fine-tuning has emerged as a promising paradigm for downstream adaptation under tight GPU memory constraints. However, this sequential pipeline fails to leverage the intricate interaction between quantization bit-width and LoRA rank. Specifically, a carefully optimized quantization allocation with low quantization error does not always translate to strong fine-tuning performance, and different bit-width and rank configurations can lead to significantly varying outcomes under the same memory budget. To address this limitation, we propose AutoQRA, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process. To tackle the challenges posed by the large discrete search space and the high evaluation cost associated with frequent fine-tuning iterations, AutoQRA decomposes the optimization process into two stages. First, it first conducts a global multi-fidelity evolutionary search, where the initial population is warm-started by injecting layer-wise importance priors. This stage employs specific operators and a performance model to efficiently screen candidate configurations. Second, trust-region Bayesian optimization is applied to locally refine promising regions of the search space and identify optimal configurations under the given memory budget. This approach enables active compensation for quantization noise in specific layers during training. Experiments show that AutoQRA achieves performance close to full-precision fine-tuning with a memory footprint comparable to uniform 4-bit methods.
Paper Structure (34 sections, 29 equations, 10 figures, 2 tables)

This paper contains 34 sections, 29 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Empirical Motivation for Joint Optimization.(a) Impact of Joint Allocation: We visualize the accuracy distribution of feasible mixed-precision configurations across tasks. The substantial performance spread demonstrates that distinct pairings of bit-width ($q$) and rank ($r$) yield vastly different outcomes even under the same memory budget. (b) Proxy-Objective Mismatch: Standard calibration metrics (Perplexity, x-axis) fail to predict post-finetuning accuracy (y-axis). The weak correlation ($\rho{=}0.46$) and frequent rank reversals indicate that static proxies cannot reliably identify configurations where learnable adapters compensate for quantization noise.
  • Figure 2: Overview of the AutoQRA framework.Phase I (left) approximates the global Pareto frontier via a multi-fidelity evolutionary search, utilizing importance-guided mutations and surrogate screening to navigate the discrete space. Phase II (right) performs a local Bayesian refinement to identify a precise operating point that maximizes user utility under the budget constraint.
  • Figure 3: Layer wise configurations found by AutoQRA show a compensation pattern. Layers assigned lower bit widths are often paired with higher ranks, suggesting that adapter capacity is reallocated to compensate for quantization noise.
  • Figure 4: Surrogate Quality. Surrogate accuracy improves with paired data and boosts top-3 promotion hit rate.
  • Figure 5: Search efficiency analysis. (Left) Best validation performance versus the number of evaluations at the largest search budget $b_S$. AutoQRA improves rapidly and consistently outperforms random search. (Right) Number of largest-budget evaluations required to reach a target accuracy. AutoQRA needs 6 evaluations compared to 107 for random search, yielding an $18\times$ reduction in expensive evaluations.
  • ...and 5 more figures