Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

Qian Chen; Wen Wang; Qinglin Zhang; Siqi Zheng; Shiliang Zhang; Chong Deng; Hai Yu; Jiaqing Liu; Yukun Ma; Chong Zhang

Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Hai Yu, Jiaqing Liu, Yukun Ma, Chong Zhang

TL;DR

The paper addresses the challenge that standard Transformer attention struggles to capture deep cross-layer dependencies between abstract representations and fine-grained details. It proposes Skip-Layer Attention (SLA), which lets $Q$ in layer $l$ attend to $K,V$ from the current layer and up to $n_l$ preceding layers, preserving computational efficiency with $n_h$ skip heads. Empirical results on OpenWebText with GPT-2 variants show consistent improvements, with 9 skip layers and 9 skip heads yielding the strongest gains, especially for longer sequence lengths. These findings demonstrate that incorporating non-adjacent layer connections can enhance hierarchical representation learning in Transformers, guiding future architectural refinements for large-scale language models.

Abstract

The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows, refining the Transformer's architecture becomes critical. This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models by enabling direct attention between non-adjacent layers. This method improves the model's ability to capture dependencies between high-level abstract features and low-level details. By facilitating direct attention between these diverse feature levels, our approach overcomes the limitations of current Transformers, which often rely on suboptimal intra-layer attention. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer, thus enhancing the diversity of multi-head attention without additional computational burden. Extensive experiments demonstrate that our enhanced Transformer model achieves superior performance in language modeling tasks, highlighting the effectiveness of our skip-layer attention mechanism.

Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

TL;DR

in layer

attend to

from the current layer and up to

preceding layers, preserving computational efficiency with

skip heads. Empirical results on OpenWebText with GPT-2 variants show consistent improvements, with 9 skip layers and 9 skip heads yielding the strongest gains, especially for longer sequence lengths. These findings demonstrate that incorporating non-adjacent layer connections can enhance hierarchical representation learning in Transformers, guiding future architectural refinements for large-scale language models.

Abstract

Paper Structure (14 sections, 1 equation, 1 figure, 3 tables)

This paper contains 14 sections, 1 equation, 1 figure, 3 tables.

Introduction
Related Work
Method
Experimental Setup
Dataset
Training Setup
Result
Number of skip layers
Number of skip heads
Model Size and Sequence Length Variations
Conclusion
Limitations
Appendix
Code

Figures (1)

Figure 1: Model architecture of the Transformer with skip-layer attention. The left figure illustrates a Transformer model with 12 layers, each equipped with an additional skip-layer attention connection (e.g., layer 1 to layer 10, layer 2 to layer 11, layer 3 to layer 12). The center figure provides a zoomed-in view of each layer, highlighting the skip-layer attention and MLP sublayers. The right figure details the skip-layer attention mechanism, with red indicating keys and values from the preceding layer.

Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

TL;DR

Abstract

Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (1)