Table of Contents
Fetching ...

Greedy-Gnorm: A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning

Yuxi Guo, Paul Sheridan

TL;DR

Greedy-Gnorm tackles the inefficiency of static head-importance scores in transformer pruning by dynamically recomputing a gradient-based head score after each prune. The score is defined as the elementwise product of the $\ell_2$ norms of the Q, K, and V gradient blocks, yielding a $S(n) = G_{Q(n)} \odot G_{K(n)} \odot G_{V(n)}$ that adapts to gradient redistribution. To stabilize comparisons and prevent numerical issues, the method employs an $\varepsilon$-rectified entropy variant (form $C$) for any AE-based baselines. Across BERT, ALBERT, RoBERTa, and XLM-RoBERTa, Greedy-Gnorm preserves high task accuracy under substantial head removal and surpasses AE and random pruning in pruning trajectories. The work provides a practical, gradient-informed, iterative pruning framework with potential for integration with other compression techniques to enable greener, more deployable transformer models.

Abstract

Attention head pruning has emerged as an effective technique for transformer model compression, an increasingly important goal in the era of Green AI. However, existing pruning methods often rely on static importance scores, which fail to capture the evolving role of attention heads during iterative removal. We propose Greedy-Gradient norm (Greedy-Gnorm), a novel head pruning algorithm that dynamically recalculates head importance after each pruning step. Specifically, each head is scored by the elementwise product of the l2-norms of its Q/K/V gradient blocks, as estimated from a hold-out validation set and updated at every greedy iteration. This dynamic approach to scoring mitigates against stale rankings and better reflects gradient-informed importance as pruning progresses. Extensive experiments on BERT, ALBERT, RoBERTa, and XLM-RoBERTa demonstrate that Greedy-Gnorm consistently preserves accuracy under substantial head removal, outperforming attention entropy. By effectively reducing model size while maintaining task performance, Greedy-Gnorm offers a promising step toward more energy-efficient transformer model deployment.

Greedy-Gnorm: A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning

TL;DR

Greedy-Gnorm tackles the inefficiency of static head-importance scores in transformer pruning by dynamically recomputing a gradient-based head score after each prune. The score is defined as the elementwise product of the norms of the Q, K, and V gradient blocks, yielding a that adapts to gradient redistribution. To stabilize comparisons and prevent numerical issues, the method employs an -rectified entropy variant (form ) for any AE-based baselines. Across BERT, ALBERT, RoBERTa, and XLM-RoBERTa, Greedy-Gnorm preserves high task accuracy under substantial head removal and surpasses AE and random pruning in pruning trajectories. The work provides a practical, gradient-informed, iterative pruning framework with potential for integration with other compression techniques to enable greener, more deployable transformer models.

Abstract

Attention head pruning has emerged as an effective technique for transformer model compression, an increasingly important goal in the era of Green AI. However, existing pruning methods often rely on static importance scores, which fail to capture the evolving role of attention heads during iterative removal. We propose Greedy-Gradient norm (Greedy-Gnorm), a novel head pruning algorithm that dynamically recalculates head importance after each pruning step. Specifically, each head is scored by the elementwise product of the l2-norms of its Q/K/V gradient blocks, as estimated from a hold-out validation set and updated at every greedy iteration. This dynamic approach to scoring mitigates against stale rankings and better reflects gradient-informed importance as pruning progresses. Extensive experiments on BERT, ALBERT, RoBERTa, and XLM-RoBERTa demonstrate that Greedy-Gnorm consistently preserves accuracy under substantial head removal, outperforming attention entropy. By effectively reducing model size while maintaining task performance, Greedy-Gnorm offers a promising step toward more energy-efficient transformer model deployment.
Paper Structure (32 sections, 27 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 27 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: ALBERT tied-head structure (appearing as a vertical band in the mask visualization). Head $h$ shares its parameters across all $L$ layers, gating a single mask entry disables the same head index in every layer. One mask position affects 12 heads across layers.
  • Figure 2: Gradient changes after pruning (downsampled by $64\!\times\!64$ pooling from $768\!\times\!768$ to $12\!\times\!12$). Colors differ from $0$ across settings, indicating nonlocal gradient shifts.
  • Figure 3: Greedy-Gnorm vs. AE (and inverse variants). Greedy-Gnorm is more stable and preserves accuracy under deeper pruning.
  • Figure 4: Greedy-Gnorm versus random pruning across pruning rates. Boxplots summarize multiple random masks per rate, while the green curves show Greedy-Gnorm task accuracy. The dashed yellow line shows accuracy with no pruning.
  • Figure 5: Final pruning masks across models. BERT shows dispersed retention; ALBERT shows vertical bands due to parameter sharing; RoBERTa/XLM-RoBERTa display selective, layer-dependent retention.