Table of Contents
Fetching ...

GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

Selim An, Il hong Suh, Yeseong Kim

Abstract

Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit. GlowQ computes the high-precision projection once per input-sharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by (5.6%) and increases throughput by (9.6%) on average, while reducing perplexity on WikiText-2 by (0.17%) and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by (23.4%) and increasing throughput by (37.4%), while maintaining accuracy within 0.2 percentage points on average.

GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

Abstract

Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit. GlowQ computes the high-precision projection once per input-sharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by (5.6%) and increases throughput by (9.6%) on average, while reducing perplexity on WikiText-2 by (0.17%) and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by (23.4%) and increasing throughput by (37.4%), while maintaining accuracy within 0.2 percentage points on average.

Paper Structure

This paper contains 84 sections, 66 equations, 13 figures, 21 tables, 1 algorithm.

Figures (13)

  • Figure 1: GlowQ Overview
  • Figure 2: Input spectrum and energy-capture measurements. (a) We stream calibration samples through the model, collect the input activations at the target layer, and plot the eigenvalue spectrum of the empirical input covariance for the QKV and MLP groups, revealing a heavy-tailed profile. (b) The same spectra plotted in $\log_{10}\lambda_r$--$\log_{10} r$ coordinates; dotted lines show least-squares fits over the approximately linear tail region, indicating power-law decay $\lambda_r \propto r^{-\alpha}$ with exponents $\alpha_{\text{MLP}} \approx 0.77$ and $\alpha_{\text{QKV}} \approx 1.19$. (c-d) For each group, we vertically stack the quantization-error matrices and plot the cumulative fraction of Frobenius energy recovered by the best rank-$r$ approximation. We show both the unweighted baseline (No cov) and the covariance-aligned variant that weights errors by the observed inputs (Cov align). Horizontal dashed lines mark 90% and 95% energy capture.
  • Figure 3: Perplexity (PPL) and time-to-first-byte (TTFB) versus the fraction of restored groups.
  • Figure 4: Comparison of memory and performance trade-off. (a) Memory overhead of different methods. (b) PPL at equal memory budget.
  • Figure 5: Whitening vs. non-whitening alignment matrices. For LLaMA 3.2-3B, we estimate a shared right basis $B_{\text{shared}}$ from the stacked error either without covariance weighting ($E_{\mathrm{cat}}$, left panels) or with covariance-aware whitening ($E_{\mathrm{cat}}\Sigma_x^{1/2}$, right panels). Each heatmap shows the absolute basis alignment between $\mathrm{row}(B_{\text{shared}})$ and the per-module right subspace for Q, K, V; brighter values denote larger absolute inner products. DiagScore and Affinity summaries are reported in the main text.
  • ...and 8 more figures