Table of Contents
Fetching ...

QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim

TL;DR

QWHA introduces a quantization-aware, Walsh-Hadamard transform-based adapter for parameter-efficient fine-tuning of large language models. By formulating weight updates as $\Delta W = F H^{-1}$ with a fixed WHT kernel and a sparsified, adaptively initialized coefficient matrix, it achieves high representational capacity while mitigating quantization errors. The AdaAlloc initialization and subsequent value refinement ensure full-rank parameter allocations and precise error reconstruction, leading to substantial accuracy and training-speed improvements, especially at ultra-low bit-widths. The method demonstrates strong gains across multiple models and tasks with efficient computation and memory use, making QA-PEFT more practical for deployment of quantized LLMs.

Abstract

The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.

QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

TL;DR

QWHA introduces a quantization-aware, Walsh-Hadamard transform-based adapter for parameter-efficient fine-tuning of large language models. By formulating weight updates as with a fixed WHT kernel and a sparsified, adaptively initialized coefficient matrix, it achieves high representational capacity while mitigating quantization errors. The AdaAlloc initialization and subsequent value refinement ensure full-rank parameter allocations and precise error reconstruction, leading to substantial accuracy and training-speed improvements, especially at ultra-low bit-widths. The method demonstrates strong gains across multiple models and tasks with efficient computation and memory use, making QA-PEFT more practical for deployment of quantized LLMs.

Abstract

The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.

Paper Structure

This paper contains 59 sections, 30 equations, 10 figures, 20 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of Quantization-aware Walsh-Hadamard Adaptation (QWHA). The weight update from QWHA is formulated as $\Delta {\bm{W}} = {\bm{F}} {\bm{H}}^{-1}$, where ${\bm{H}}$ is a predefined Walsh-Hadamard transform (WHT) matrix and ${\bm{F}}$ is a trainable sparse coefficient matrix consisting of values ${\bm{c}}$ and their indices ${\bm{E}}$. The multiplication ${\bm{F}} {\bm{H}}^{-1}$ indicates the expansion of learned coefficients (i.e., ${\bm{c}}$), over the transform basis (i.e., columns of ${\bm{H}}^{-1}$). Note that, the coefficients ${\bm{c}}$ are the only trainable parameters, and ${\bm{H}}$ remains constant. Our key contributions are in the adoption of WHT into the adapter (WHA) and their initialization, particularly ${\bm{E}}$ (AdaAlloc) and ${\bm{c}}$ (Refinement).
  • Figure 2: (a) Comparison of rank in weight updates between low-rank and FT-based adapters across linear layers. (b) Cumulative distribution of $\ell_2$ norm of singular values and transform coefficients with Pareto hill index $\eta$ for the quantization error $\Delta{\bm{W}}_Q$ in the 14th-layer Value projection. The vertical blue line indicates a point where the adapters have the same number of parameters.
  • Figure 3: (a) Average coverage of outlier components within the selected parameters. (b) $\ell_2$ norm of the layer output error after initialization on the 14th-layer Key projection. The vertical blue lines indicate points where the adapters have the same number of parameters.
  • Figure 4: Rank of adapter weights for each parameter selection methods.
  • Figure 5: Effect of refinement on average layer output error.
  • ...and 5 more figures