Table of Contents
Fetching ...

Can Post-Training Quantization Benefit from an Additional QLoRA Integration?

Xiliang Zhu, Elena Khasanova, Cheng Chen

TL;DR

The paper tackles the challenge of deploying large language models in resource-constrained environments by combining 4-bit Post-training Quantization (PTQ) with QLoRA, aiming to preserve generation quality while reducing memory and latency. The proposed PTQ-QLoRA pipeline begins with 16-bit supervised fine-tuning, applies 4-bit PTQ, and completes a final fine-tuning pass using QLoRA, evaluated across three decoder-only base models and two quantization backends. Results indicate that PTQ-QLoRA often matches or surpasses 16-bit full fine-tuning and outperforms standard PTQ, with robust performance across internal and external task datasets and minimal loss in factual consistency. This approach offers a practical, scalable pathway for deploying powerful LLMs in production settings with limited computational resources.

Abstract

Large language models (LLMs) have transformed natural language processing but pose significant challenges for real-world deployment. These models necessitate considerable computing resources, which can be costly and frequently unavailable. Model compression techniques such as quantization are often leveraged to alleviate resource demand, but they may have a negative impact on the generation quality. In this study, we explore the integration of 4-bit Post-training Quantization (PTQ) with QLoRA to address these issues. We demonstrate through extensive experiments that this integration outperforms standard PTQ, and in some cases even 16-bit full-parameter fine-tuning on LLMs, validated across proprietary and public datasets with different quantization algorithms. The results demonstrate the efficacy of PTQ-QLoRA integration, offering a viable solution for deploying powerful LLMs in resource-constrained environments without compromising on performance.

Can Post-Training Quantization Benefit from an Additional QLoRA Integration?

TL;DR

The paper tackles the challenge of deploying large language models in resource-constrained environments by combining 4-bit Post-training Quantization (PTQ) with QLoRA, aiming to preserve generation quality while reducing memory and latency. The proposed PTQ-QLoRA pipeline begins with 16-bit supervised fine-tuning, applies 4-bit PTQ, and completes a final fine-tuning pass using QLoRA, evaluated across three decoder-only base models and two quantization backends. Results indicate that PTQ-QLoRA often matches or surpasses 16-bit full fine-tuning and outperforms standard PTQ, with robust performance across internal and external task datasets and minimal loss in factual consistency. This approach offers a practical, scalable pathway for deploying powerful LLMs in production settings with limited computational resources.

Abstract

Large language models (LLMs) have transformed natural language processing but pose significant challenges for real-world deployment. These models necessitate considerable computing resources, which can be costly and frequently unavailable. Model compression techniques such as quantization are often leveraged to alleviate resource demand, but they may have a negative impact on the generation quality. In this study, we explore the integration of 4-bit Post-training Quantization (PTQ) with QLoRA to address these issues. We demonstrate through extensive experiments that this integration outperforms standard PTQ, and in some cases even 16-bit full-parameter fine-tuning on LLMs, validated across proprietary and public datasets with different quantization algorithms. The results demonstrate the efficacy of PTQ-QLoRA integration, offering a viable solution for deploying powerful LLMs in resource-constrained environments without compromising on performance.

Paper Structure

This paper contains 20 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Diagram of the PTQ-QLoRA integration. Note that we apply the same fine-tuning datasets twice during full-parameter SFT and QLoRA fine-tuning respectively.