Table of Contents
Fetching ...

Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

Wang Zixian

TL;DR

This work addresses why pre-trained transformers often struggle to adapt to target domains, attributing it to gradient suppression caused by output saturation in lower layers, which biases adaptation toward high-level feature recombination. It introduces a diagnostic framework built on layer-wise observables—attention entropy, activation-gradient norm, parameter-gradient norm, and ΔCKA under a shared PCA basis—to identify inflection layers where gradient flow is most suppressed. A diagnose-first strategy combines automatic inflection-layer localization (via SKI) with selective LoRA injection in the identified band, enabling restoration of backward signals with minimal parameter overhead. Empirical evaluation on a BERT-base transfer task (SST-2 to Rotten Tomatoes) under UNDER/OVER regimes shows that selective LoRA at inflection layers yields the best accuracy (≈91.59%) with ~0.3M parameters, outperforming uniform LoRA and full or shallow unfreezing; results support a high-level composition vs low-level reconstruction dichotomy and demonstrate a practical, reproducible pipeline for efficient transfer across domains. The approach offers a general-purpose, diagnosis-driven pathway for targeted parameter-efficient fine-tuning and motivates extensions to other modalities and architectures.

Abstract

Pre-trained Transformers often exhibit over-confidence in source patterns and difficulty in forming new target-domain patterns during fine-tuning. We formalize the mechanism of output saturation leading to gradient suppression through standard cross-entropy and softmax analysis, showing that gradient suppression at inflection layers confines adaptation to high-level recombination of existing features while preventing low-level reconstruction. We introduce a set of layer-wise diagnostic metrics -- attention entropy (saturation proxy), activation gradient norm, parameter gradient norm, and Delta-CKA under a shared PCA basis -- to identify inflection layers characterized by both low attention entropy and steep gradient decay. Building on these findings, we propose a diagnose-first, inject-light fine-tuning strategy: selectively inserting LoRA adapters at inflection layers to restore suppressed backward signals with minimal parameter overhead. Experiments on BERT-base transfer from SST-2 to Rotten Tomatoes under under-trained and over-trained source regimes reveal that over-trained initialization benefits from inflection-layer LoRA injection, while under-trained initialization suffers performance degradation. When base features are strong, unblocking inflection layers facilitates high-level compositional adaptation; when base features are weak, full-pathway unblocking is required for low-level reconstruction, as supported by joint analysis of layer-wise activation gradients and Delta-CKA dynamics.

Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

TL;DR

This work addresses why pre-trained transformers often struggle to adapt to target domains, attributing it to gradient suppression caused by output saturation in lower layers, which biases adaptation toward high-level feature recombination. It introduces a diagnostic framework built on layer-wise observables—attention entropy, activation-gradient norm, parameter-gradient norm, and ΔCKA under a shared PCA basis—to identify inflection layers where gradient flow is most suppressed. A diagnose-first strategy combines automatic inflection-layer localization (via SKI) with selective LoRA injection in the identified band, enabling restoration of backward signals with minimal parameter overhead. Empirical evaluation on a BERT-base transfer task (SST-2 to Rotten Tomatoes) under UNDER/OVER regimes shows that selective LoRA at inflection layers yields the best accuracy (≈91.59%) with ~0.3M parameters, outperforming uniform LoRA and full or shallow unfreezing; results support a high-level composition vs low-level reconstruction dichotomy and demonstrate a practical, reproducible pipeline for efficient transfer across domains. The approach offers a general-purpose, diagnosis-driven pathway for targeted parameter-efficient fine-tuning and motivates extensions to other modalities and architectures.

Abstract

Pre-trained Transformers often exhibit over-confidence in source patterns and difficulty in forming new target-domain patterns during fine-tuning. We formalize the mechanism of output saturation leading to gradient suppression through standard cross-entropy and softmax analysis, showing that gradient suppression at inflection layers confines adaptation to high-level recombination of existing features while preventing low-level reconstruction. We introduce a set of layer-wise diagnostic metrics -- attention entropy (saturation proxy), activation gradient norm, parameter gradient norm, and Delta-CKA under a shared PCA basis -- to identify inflection layers characterized by both low attention entropy and steep gradient decay. Building on these findings, we propose a diagnose-first, inject-light fine-tuning strategy: selectively inserting LoRA adapters at inflection layers to restore suppressed backward signals with minimal parameter overhead. Experiments on BERT-base transfer from SST-2 to Rotten Tomatoes under under-trained and over-trained source regimes reveal that over-trained initialization benefits from inflection-layer LoRA injection, while under-trained initialization suffers performance degradation. When base features are strong, unblocking inflection layers facilitates high-level compositional adaptation; when base features are weak, full-pathway unblocking is required for low-level reconstruction, as supported by joint analysis of layer-wise activation gradients and Delta-CKA dynamics.

Paper Structure

This paper contains 19 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Multi-seed results: Selective LoRA achieves the highest accuracy with minimal parameters, outperforming uniform LoRA (Everywhere) and traditional unfreezing strategies.
  • Figure 2: Shallow unfreezing (top-2 layers): UNDER vs. OVER diagnostics. OVER shows clear inflection layers around layer 5-7 with low entropy and a steep gradient cliff.
  • Figure 3: Full unfreezing: layer-wise diagnostics show persistent inflection patterns even with all layers trainable.
  • Figure 4: Selective LoRA injection: OVER benefits from inflection-layer LoRA while UNDER shows degradation, demonstrating that gradient suppression confines models to high-level composition; enabling low-level reconstruction requires full pathway unblocking.
  • Figure 5: Layer-wise probe accuracy: task separability concentrates in upper layers across all settings.