Table of Contents
Fetching ...

Activation Compression of Graph Neural Networks using Block-wise Quantization with Improved Variance Minimization

Sebastian Eliassen, Raghavendra Selvan

TL;DR

The paper tackles memory bottlenecks in training graph neural networks by extending EXACT with block-wise activation quantization to $INT2$ and an improved variance minimization that uses a clipped-normal activation distribution. By reshaping projected activations into blocks and applying quantization within each block, memory usage decreases further than EXACT, while preserving accuracy and delivering modest per-epoch speedups. The authors show that activation maps in GNNs are better described by a clipped-normal distribution rather than uniform, enabling non-uniform quantization boundaries that further reduce quantization variance, though this variance reduction does not translate into noticeable accuracy gains. Overall, the technique yields substantial memory savings (up to >95% vs FP32 and >15% over EXACT) and practical speedups, representing a meaningful advance for memory-efficient GNN training with BLOCK-wise INT2 quantization.

Abstract

Efficient training of large-scale graph neural networks (GNNs) has been studied with a specific focus on reducing their memory consumption. Work by Liu et al. (2022) proposed extreme activation compression (EXACT) which demonstrated drastic reduction in memory consumption by performing quantization of the intermediate activation maps down to using INT2 precision. They showed little to no reduction in performance while achieving large reductions in GPU memory consumption. In this work, we present an improvement to the EXACT strategy by using block-wise quantization of the intermediate activation maps. We experimentally analyze different block sizes and show further reduction in memory consumption (>15%), and runtime speedup per epoch (about 5%) even when performing extreme extents of quantization with similar performance trade-offs as with the original EXACT. Further, we present a correction to the assumptions on the distribution of intermediate activation maps in EXACT (assumed to be uniform) and show improved variance estimations of the quantization and dequantization steps.

Activation Compression of Graph Neural Networks using Block-wise Quantization with Improved Variance Minimization

TL;DR

The paper tackles memory bottlenecks in training graph neural networks by extending EXACT with block-wise activation quantization to and an improved variance minimization that uses a clipped-normal activation distribution. By reshaping projected activations into blocks and applying quantization within each block, memory usage decreases further than EXACT, while preserving accuracy and delivering modest per-epoch speedups. The authors show that activation maps in GNNs are better described by a clipped-normal distribution rather than uniform, enabling non-uniform quantization boundaries that further reduce quantization variance, though this variance reduction does not translate into noticeable accuracy gains. Overall, the technique yields substantial memory savings (up to >95% vs FP32 and >15% over EXACT) and practical speedups, representing a meaningful advance for memory-efficient GNN training with BLOCK-wise INT2 quantization.

Abstract

Efficient training of large-scale graph neural networks (GNNs) has been studied with a specific focus on reducing their memory consumption. Work by Liu et al. (2022) proposed extreme activation compression (EXACT) which demonstrated drastic reduction in memory consumption by performing quantization of the intermediate activation maps down to using INT2 precision. They showed little to no reduction in performance while achieving large reductions in GPU memory consumption. In this work, we present an improvement to the EXACT strategy by using block-wise quantization of the intermediate activation maps. We experimentally analyze different block sizes and show further reduction in memory consumption (>15%), and runtime speedup per epoch (about 5%) even when performing extreme extents of quantization with similar performance trade-offs as with the original EXACT. Further, we present a correction to the assumptions on the distribution of intermediate activation maps in EXACT (assumed to be uniform) and show improved variance estimations of the quantization and dequantization steps.
Paper Structure (11 sections, 20 equations, 5 figures, 2 tables)

This paper contains 11 sections, 20 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Demonstration of stochastic rounding for $b=2$ i.e., $2^b=4$ quantization bins for 128 points uniformly sampled datapoints. Here the sampled points can be quantized to any of the four levels. The closer the color of the sample is to the color of the vertical bar, the larger the probability that it quantizes to the said vertical bar. Quantization bins when using uniform bin widths (left) and when using non-linear bin widths when performing variance optimization (right) introduced in Sec \ref{['sec:variance']} is visualized..
  • Figure 2: The observed normalized activations in a GNN model on the OGB-Arxiv data (left) compared to different modelled distributions: uniform (center), and clipped normal (right). Notice the clipped normal is able to model the observed distribution more accurately, including the edges where the spikes are caused due to clipping at the boundaries.
  • Figure 3: Variance of SR for INT2 quantization with different quantization boundaries $[\alpha,\beta]$ based on Eq. \ref{['eq:sr_var']}. When $[\alpha=1.0,\beta=2.0]$ uniform bin width is obtained.
  • Figure 4: This plot demonstrates the relative variance reduction across different layers ${\overline{{\mathbf H}}^{(\ell)}_\text{\tt proj}}$ and a clipnormal distribution model when performing variance minimization. Here the crosses indicate what we would expect the optimal dimensionality parameter for variance minimization to be, and the circles indicate what it actually is.
  • Figure 5: This plot illustrates the relative reduction in variance achieved through optimizing the assumed dimensionality in a clipnorm distribution, as indicated by different '$D\#$' hyperparameters. Each line represents the mean variance reduction across multiple trials for a given dimension, with shaded areas showing the range from minimum to maximum variance reduction observed. The markers on the line show the observed and expected maxima, highlighting the performance and consistency of the optimization across dimensions. The horizontal solid lines indicate the spread of observed maxima, providing insights into the variability and stability of the optimization process for each hyperparameter.