Table of Contents
Fetching ...

AutoMixQ: Self-Adjusting Quantization for High Performance Memory-Efficient Fine-Tuning

Changhai Zhou, Shiyang Zhang, Yuhua Zhou, Zekai Liu, Shichao Weng

TL;DR

This work proposes AutoMixQ, an end-to-end optimization framework that selects optimal quantization configurations for each LLM layer that leverages lightweight performance models to guide the selection process, significantly reducing time and computational resources compared to exhaustive search methods.

Abstract

Fine-tuning large language models (LLMs) under resource constraints is a significant challenge in deep learning. Low-Rank Adaptation (LoRA), pruning, and quantization are all effective methods for improving resource efficiency. However, combining them directly often results in suboptimal performance, especially with uniform quantization across all model layers. This is due to the complex, uneven interlayer relationships introduced by pruning, necessitating more refined quantization strategies. To address this, we propose AutoMixQ, an end-to-end optimization framework that selects optimal quantization configurations for each LLM layer. AutoMixQ leverages lightweight performance models to guide the selection process, significantly reducing time and computational resources compared to exhaustive search methods. By incorporating Pareto optimality, AutoMixQ balances memory usage and performance, approaching the upper bounds of model capability under strict resource constraints. Our experiments on widely used benchmarks show that AutoMixQ reduces memory consumption while achieving superior performance. For example, at a 30\% pruning rate in LLaMA-7B, AutoMixQ achieved 66.21\% on BoolQ compared to 62.45\% for LoRA and 58.96\% for LoftQ, while reducing memory consumption by 35.5\% compared to LoRA and 27.5\% compared to LoftQ.

AutoMixQ: Self-Adjusting Quantization for High Performance Memory-Efficient Fine-Tuning

TL;DR

This work proposes AutoMixQ, an end-to-end optimization framework that selects optimal quantization configurations for each LLM layer that leverages lightweight performance models to guide the selection process, significantly reducing time and computational resources compared to exhaustive search methods.

Abstract

Fine-tuning large language models (LLMs) under resource constraints is a significant challenge in deep learning. Low-Rank Adaptation (LoRA), pruning, and quantization are all effective methods for improving resource efficiency. However, combining them directly often results in suboptimal performance, especially with uniform quantization across all model layers. This is due to the complex, uneven interlayer relationships introduced by pruning, necessitating more refined quantization strategies. To address this, we propose AutoMixQ, an end-to-end optimization framework that selects optimal quantization configurations for each LLM layer. AutoMixQ leverages lightweight performance models to guide the selection process, significantly reducing time and computational resources compared to exhaustive search methods. By incorporating Pareto optimality, AutoMixQ balances memory usage and performance, approaching the upper bounds of model capability under strict resource constraints. Our experiments on widely used benchmarks show that AutoMixQ reduces memory consumption while achieving superior performance. For example, at a 30\% pruning rate in LLaMA-7B, AutoMixQ achieved 66.21\% on BoolQ compared to 62.45\% for LoRA and 58.96\% for LoftQ, while reducing memory consumption by 35.5\% compared to LoRA and 27.5\% compared to LoftQ.

Paper Structure

This paper contains 23 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The workflow of AutoMixQ begins with a fine-tuning dataset, which is used to build a performance model. The Pareto frontier is established based on the model's predictions and known data, and a configuration that best fits the objective function is selected. This configuration is then fine-tuned on the LLM, and the fine-tuning results are used to update the performance model. This cycle of prediction, selection, evaluation, and updating continues until the Pareto frontier stabilizes or a predefined iteration limit is reached.
  • Figure 2: Pareto-front scatter plots for BoolQ and WinoGrande with 50 data points. The red points indicate the non-dominated configurations within the Pareto frontier.
  • Figure 3: Sample generation for LLaMA-7B and Vicuna-7B models with a 20% pruning rate.
  • Figure 4: Pareto-front scatter plots for different Downstream Tasks