Table of Contents
Fetching ...

Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

Palmer Schallon

TL;DR

Surgical reinitialization is introduced: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters and reveals two distinct post-surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal.

Abstract

We identify a systematic attention collapse pathology in the BLOOM family of transformer language models, where ALiBi positional encoding causes 31-44% of attention heads to attend almost entirely to the beginning-of-sequence token. The collapse follows a predictable pattern across four model scales (560M to 7.1B parameters), concentrating in head indices where ALiBi's slope schedule imposes the steepest distance penalties. We introduce surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters. Applied to BLOOM-1b7 on a single consumer GPU, the technique recovers 98.7% operational head capacity (242 to 379 of 384 heads) in two passes. A controlled comparison with C4 training data confirms that reinitialization -- not corpus content -- drives recovery, and reveals two distinct post-surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal. An extended experiment reinitializing mostly-healthy heads alongside collapsed ones produces a model that transiently outperforms stock BLOOM-1b7 by 25% on training perplexity (12.70 vs. 16.99), suggesting that pretrained attention configurations are suboptimal local minima. Code, checkpoints, and diagnostic tools are released as open-source software.

Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

TL;DR

Surgical reinitialization is introduced: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters and reveals two distinct post-surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal.

Abstract

We identify a systematic attention collapse pathology in the BLOOM family of transformer language models, where ALiBi positional encoding causes 31-44% of attention heads to attend almost entirely to the beginning-of-sequence token. The collapse follows a predictable pattern across four model scales (560M to 7.1B parameters), concentrating in head indices where ALiBi's slope schedule imposes the steepest distance penalties. We introduce surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters. Applied to BLOOM-1b7 on a single consumer GPU, the technique recovers 98.7% operational head capacity (242 to 379 of 384 heads) in two passes. A controlled comparison with C4 training data confirms that reinitialization -- not corpus content -- drives recovery, and reveals two distinct post-surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal. An extended experiment reinitializing mostly-healthy heads alongside collapsed ones produces a model that transiently outperforms stock BLOOM-1b7 by 25% on training perplexity (12.70 vs. 16.99), suggesting that pretrained attention configurations are suboptimal local minima. Code, checkpoints, and diagnostic tools are released as open-source software.
Paper Structure (28 sections, 2 equations, 7 figures, 10 tables)

This paper contains 28 sections, 2 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Bimodal distribution of BOS mass across all 384 heads in stock BLOOM-1b7. Heads cluster near 0.0 (healthy, content-dependent attention) or above 0.8 (collapsed, BOS-fixated), with very few in the intermediate range. Our 0.50 threshold falls in the sparse valley, making classification robust.
  • Figure 2: Cross-scale BOS-sink band pattern across the BLOOM family (560M, 1.7B, 3B, 7.1B). Each panel shows a layer $\times$ head heatmap colored by BOS mass. The sick band (upper head indices, darker color) appears consistently across all scales.
  • Figure 3: Three-way attention topology comparison. Stock BLOOM-1b7 (left), curated surgery E3 (center), C4 baseline E3 (right). Each panel shows the $24 \times 16$ head grid colored by BOS mass. Curated surgery drives more global redistribution outside the surgical zone while achieving lower perplexity.
  • Figure 4: Two-phenomenon comparison: curated E3 vs. C4 E3 redistribution patterns at matched epochs. Phenomenon 1 (functional redistribution): early, global, beneficial---curated corpus drives more outside-zone redistribution. Phenomenon 2 (local degradation): late, local, pathological---C4 at E15 shows doubled in-band frozen drift relative to E3.
  • Figure 5: H5 sub-epoch PPL trajectory. Perplexity drops monotonically from post-reinitialization (19.25) to 12.70 at step 42 (30% of epoch 1), crossing the stock BLOOM-1b7 baseline (16.99) by step 10. Overfitting begins after step 42, causing perplexity to rise above the stock baseline after approximately one full epoch.
  • ...and 2 more figures