HLQ: Fast and Efficient Backpropagation via Hadamard Low-rank Quantization
Seonggon Kim, Eunhyeok Park
TL;DR
HLQ tackles the high cost of backpropagation in training large models by selectively applying Hadamard-based quantization to activation gradients and Hadamard low-rank approximation to weight gradients, preserving forward accuracy. The method leverages $g_w = \frac{1}{B} \bar{g}_y^T \cdot \bar{x}$ and $g_x = g_y \cdot w$ to tailor updates, using 4-bit HQ for $g_x$ and low-rank, int8-accelerated processing for $g_w$, with an int4 activation-compression strategy (ACBP). Empirical results show HLQ achieves up to $2.5\times$ faster BP and up to $78.5\%$ memory reduction while maintaining competitive or superior accuracy across CNNs and ViTs in both training-from-scratch and fine-tuning settings, relative to baselines like LBP-WHT and LUQ. The work offers practical training-cost reductions for resource-constrained environments and suggests future extensions to large language models.
Abstract
With the rapid increase in model size and the growing importance of various fine-tuning applications, lightweight training has become crucial. Since the backward pass is twice as expensive as the forward pass, optimizing backpropagation is particularly important. However, modifications to this process can lead to suboptimal convergence, so training optimization should minimize perturbations, which is a highly challenging task. In this study, we introduce a novel optimization strategy called Hadamard Low-rank Quantization (HLQ), focusing on reducing the cost of backpropagation in convolutional and linear layers. We first analyze the sensitivity of gradient computation with respect to activation and weight, and judiciously design the HLQ pipeline to apply 4-bit Hadamard quantization to the activation gradient and Hadamard low-rank approximation to the weight gradient. This combination was found to be the best for maximizing benefits, and our extensive experiments demonstrate the outstanding performance of HLQ in both training from scratch and fine-tuning, achieving significant memory savings and acceleration on real GPUs with negligible quality degradation.
