Table of Contents
Fetching ...

HOT: Hadamard-based Optimized Training

Seonggon Kim, Juncheol Shin, Seung-taek Woo, Eunhyeok Park

TL;DR

HOT introduces a Hadamard-based framework to reduce backpropagation memory and compute by selectively applying Hadamard Quantization to activation gradients and Hadamard Low-rank Approximation to weight gradients, complemented by Activation Buffer Compression and Layer-wise Quantizer Selection. By tailoring optimizations to the distinct properties of the activation- and weight-gradient paths, HOT achieves up to 75% activation/memory savings and roughly a 2.6× speedup on real GPUs while preserving FP32 accuracy, and it proves compatible with LoRA for further efficiency. The approach is validated across vision and language tasks, including fine-tuning and pre-training, showing robustness where prior methods degrade or fail, and offering a practical route to memory-efficient training on limited hardware. The work highlights a principled, path-aware use of Hadamard-inspired techniques to shrink memory and accelerate training without sacrificing convergence or final performance, enabling broader accessibility for large foundation-model training and fine-tuning in resource-constrained settings.

Abstract

It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. Achieving this goal is highly challenging, as multiple objectives must be considered jointly while maintaining training quality. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail to identify lightweight techniques that offer the best benefits. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training (HOT). In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation, selectively and with awareness of the suitability of each optimization for different backward paths. Additionally, we introduce two enhancements: activation buffer compression and layer-wise quantizer selection. Our extensive analysis shows that HOT achieves up to 75% memory savings and a 2.6 times acceleration on real GPUs, with negligible accuracy loss compared to FP32 precision.

HOT: Hadamard-based Optimized Training

TL;DR

HOT introduces a Hadamard-based framework to reduce backpropagation memory and compute by selectively applying Hadamard Quantization to activation gradients and Hadamard Low-rank Approximation to weight gradients, complemented by Activation Buffer Compression and Layer-wise Quantizer Selection. By tailoring optimizations to the distinct properties of the activation- and weight-gradient paths, HOT achieves up to 75% activation/memory savings and roughly a 2.6× speedup on real GPUs while preserving FP32 accuracy, and it proves compatible with LoRA for further efficiency. The approach is validated across vision and language tasks, including fine-tuning and pre-training, showing robustness where prior methods degrade or fail, and offering a practical route to memory-efficient training on limited hardware. The work highlights a principled, path-aware use of Hadamard-inspired techniques to shrink memory and accelerate training without sacrificing convergence or final performance, enabling broader accessibility for large foundation-model training and fine-tuning in resource-constrained settings.

Abstract

It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. Achieving this goal is highly challenging, as multiple objectives must be considered jointly while maintaining training quality. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail to identify lightweight techniques that offer the best benefits. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training (HOT). In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation, selectively and with awareness of the suitability of each optimization for different backward paths. Additionally, we introduce two enhancements: activation buffer compression and layer-wise quantizer selection. Our extensive analysis shows that HOT achieves up to 75% memory savings and a 2.6 times acceleration on real GPUs, with negligible accuracy loss compared to FP32 precision.

Paper Structure

This paper contains 37 sections, 8 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Memory requirements for training ViT-B vit on ImageNet-1k dataset imagenet with varying batch sizes. While FP and other efficienct BP methods fail to train with batch sizes of 256 and above on a single GPU having 24 GB memory, HOT enables training with batch sizes up to 1024.
  • Figure 2: Component-wise memory consumption breakdown for different methods when training ViT-B vit on ImageNet-1k dataset imagenet with a batch size of 256.
  • Figure 3: Illustration of (a) internal HLA and (b) external HLA. Internal HLA reduces dimension $N$ to $r$, while external HLA compress $M$ to $r$. The operator $\odot$ represents matrix multiplication.
  • Figure 4: Layer-wise MSE error analysis for ViT-B (top) and ResNet-50 (bottom). The graphs demonstrate higher errors in HT+INT4 for the weight gradient ($g_w$) path, while showing accumulated errors in HLA for the activation gradient ($g_x$) path.
  • Figure 5: The pipelines for (a) standard BP, (b) LBP-WHT lbp-wht, and (c) HOT. HOT reduces memory consumption by compressing activations for BP using HLA and INT8 quantization, storing them in CTX (shown in red in (c)). It accelerates computations through integer matrix multiplication for gradient calculations (represented by INT8 and INT4 rectangular sections in (c)).
  • ...and 4 more figures