Table of Contents
Fetching ...

DAPE: Data-Adaptive Positional Encoding for Length Extrapolation

Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, Yu Li

TL;DR

A Data-Adaptive Positional Encoding (DAPE) method, which dynamically and semantically adjusts based on input context and learned fixed priors is proposed, which enhances model performances in terms of trained length and length generalization.

Abstract

Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to distinguish token positions in given sequences. However, both APE and RPE remain fixed after model training regardless of input data, limiting their adaptability and flexibility. Hence, we expect that the desired positional encoding should be data-adaptive and can be dynamically adjusted with the given attention. In this paper, we propose a Data-Adaptive Positional Encoding (DAPE) method, which dynamically and semantically adjusts based on input context and learned fixed priors. Experimental validation on real-world datasets (Arxiv, Books3, and CHE) demonstrates that DAPE enhances model performances in terms of trained length and length generalization, where the improvements are statistically significant. The model visualization suggests that our model can keep both local and anti-local information. Finally, we successfully train the model on sequence length 128 and achieve better performance at evaluation sequence length 8192, compared with other static positional encoding methods, revealing the benefit of the adaptive positional encoding method.

DAPE: Data-Adaptive Positional Encoding for Length Extrapolation

TL;DR

A Data-Adaptive Positional Encoding (DAPE) method, which dynamically and semantically adjusts based on input context and learned fixed priors is proposed, which enhances model performances in terms of trained length and length generalization.

Abstract

Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to distinguish token positions in given sequences. However, both APE and RPE remain fixed after model training regardless of input data, limiting their adaptability and flexibility. Hence, we expect that the desired positional encoding should be data-adaptive and can be dynamically adjusted with the given attention. In this paper, we propose a Data-Adaptive Positional Encoding (DAPE) method, which dynamically and semantically adjusts based on input context and learned fixed priors. Experimental validation on real-world datasets (Arxiv, Books3, and CHE) demonstrates that DAPE enhances model performances in terms of trained length and length generalization, where the improvements are statistically significant. The model visualization suggests that our model can keep both local and anti-local information. Finally, we successfully train the model on sequence length 128 and achieve better performance at evaluation sequence length 8192, compared with other static positional encoding methods, revealing the benefit of the adaptive positional encoding method.
Paper Structure (48 sections, 4 equations, 21 figures, 9 tables)

This paper contains 48 sections, 4 equations, 21 figures, 9 tables.

Figures (21)

  • Figure 1: Visualization of DAPE learned positional biases for the 8192th query position with key positions between 1 and 8192, while the training length is 512. We notice that DAPE learns both local and anti-local position patterns. The model is trained with Equation \ref{['eq:CAPE-attn-mat']}: (1) The Attention is ${\bm{X}} {\bm{W}}_Q({\bm{X}} {\bm{W}}_K)^{\top}$; (2) The Kerple bias is ${\bm{B}}$; (3) The DAPE (with Kerple) bias is $f( {\bm{X}} {\bm{W}}_Q({\bm{X}} {\bm{W}}_K)^{\top},{\bm{B}})$. More examples are shown in Appendix \ref{['more CAPE visualization']}
  • Figure 2: Comparisons with baselines: performance with training lengths 128 and 512 on Arxiv and Books3 datasets.
  • Figure 3: Results on the training length 1024.
  • Figure 4: The effect of model size: for the 350M model, the performance with training lengths 128 and 512 on the Arxiv dataset.
  • Figure 5: Different variants of DAPE: the DAPE-Kerple performance under different variants. (1) Add_Residual: ${\bm{X}} {\bm{W}}_Q({\bm{X}} {\bm{W}}_K)^{\top} + {\bm{B}}+ f( {\bm{X}} {\bm{W}}_Q({\bm{X}} {\bm{W}}_K)^{\top}+{\bm{B}})$; (2) Concate: ${\bm{X}} {\bm{W}}_Q({\bm{X}} {\bm{W}}_K)^{\top}+ f( {\bm{X}} {\bm{W}}_Q({\bm{X}} {\bm{W}}_K)^{\top},{\bm{B}})$; (3) Concate_Residual: ${\bm{X}} {\bm{W}}_Q({\bm{X}} {\bm{W}}_K)^{\top}+ {\bm{B}} + f( {\bm{X}} {\bm{W}}_Q({\bm{X}} {\bm{W}}_K)^{\top},{\bm{B}})$.
  • ...and 16 more figures