DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models
Wenjin Ke, Zhe Li, Dong Li, Lu Tian, Emad Barsoum
TL;DR
DL-QAT tackles the inefficiency of quantization-aware training for large language models by decomposing weight updates into group-specific magnitude training and LoRA-based fine-tuning within a predefined quantization space, enabling training of less than 1% of parameters. By combining group-wise magnitude scaling with low-rank updates, it achieves competitive accuracy at low-bit quantization while drastically reducing training memory and time. Evaluations on LLaMA and LLaMA2 show consistent gains over QA-LoRA and LLM-QAT across multiple benchmarks, including MMLU and LM-Eval, and surpass perplexity on WikiText-2. The approach offers a practical, resource-efficient path to deploy quantized LLMs in constrained environments.
Abstract
Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters. Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quantization group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU on 3-bit LLaMA-7B model. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the superior performance and efficiency of our approach.
