HOT: Hadamard-based Optimized Training
Seonggon Kim, Juncheol Shin, Seung-taek Woo, Eunhyeok Park
TL;DR
HOT introduces a Hadamard-based framework to reduce backpropagation memory and compute by selectively applying Hadamard Quantization to activation gradients and Hadamard Low-rank Approximation to weight gradients, complemented by Activation Buffer Compression and Layer-wise Quantizer Selection. By tailoring optimizations to the distinct properties of the activation- and weight-gradient paths, HOT achieves up to 75% activation/memory savings and roughly a 2.6× speedup on real GPUs while preserving FP32 accuracy, and it proves compatible with LoRA for further efficiency. The approach is validated across vision and language tasks, including fine-tuning and pre-training, showing robustness where prior methods degrade or fail, and offering a practical route to memory-efficient training on limited hardware. The work highlights a principled, path-aware use of Hadamard-inspired techniques to shrink memory and accelerate training without sacrificing convergence or final performance, enabling broader accessibility for large foundation-model training and fine-tuning in resource-constrained settings.
Abstract
It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. Achieving this goal is highly challenging, as multiple objectives must be considered jointly while maintaining training quality. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail to identify lightweight techniques that offer the best benefits. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training (HOT). In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation, selectively and with awareness of the suitability of each optimization for different backward paths. Additionally, we introduce two enhancements: activation buffer compression and layer-wise quantizer selection. Our extensive analysis shows that HOT achieves up to 75% memory savings and a 2.6 times acceleration on real GPUs, with negligible accuracy loss compared to FP32 precision.
