Table of Contents
Fetching ...

MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation

Weiguo Gao

TL;DR

A novel relative positional encoding method, called MEP, which employs a weighted average to combine distinct kernel functions to generate a bias that is applied to post-softmax attention scores, and is seamlessly incorporated as a penalty to the post-softmax scores.

Abstract

When the predicted sequence length exceeds the length seen during training, the transformer's inference accuracy diminishes. Existing relative position encoding methods, such as those based on the ALiBi technique, address the length extrapolation challenge exclusively through the implementation of a single kernel function, which introduces a constant bias to every post-softmax attention scores according to their distance. These approaches do not investigate or employ multiple kernel functions to address the extrapolation challenge. Drawing on the ALiBi approach, this study proposes a novel relative positional encoding method, called MEP, which employs a weighted average to combine distinct kernel functions(such as the exponential kernel and the Gaussian kernel) to generate a bias that is applied to post-softmax attention scores. Initially, the framework utilizes various kernel functions to construct multiple kernel functions. Each kernel function adheres to a consistent mean weight coefficient, harnessing the synergistic advantages of different kernels to formulate an innovative bias function. Subsequently, specific slopes are tailored for each kernel function, applying penalties at varying rates, to enhance the model's extrapolation capabilities. Finally, this bias is seamlessly incorporated as a penalty to the post-softmax scores. We present two distinct versions of our method: a parameter-free variant that requires no new learnable parameters, which enhances length extrapolation capabilities without compromising training efficiency, and a parameterized variant capable of integrating state-of-the-art techniques. Empirical evaluations across diverse datasets have demonstrated that both variants of our method achieve state-of-the-art performance, outperforming traditional parameter-free and parameterized approaches.

MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation

TL;DR

A novel relative positional encoding method, called MEP, which employs a weighted average to combine distinct kernel functions to generate a bias that is applied to post-softmax attention scores, and is seamlessly incorporated as a penalty to the post-softmax scores.

Abstract

When the predicted sequence length exceeds the length seen during training, the transformer's inference accuracy diminishes. Existing relative position encoding methods, such as those based on the ALiBi technique, address the length extrapolation challenge exclusively through the implementation of a single kernel function, which introduces a constant bias to every post-softmax attention scores according to their distance. These approaches do not investigate or employ multiple kernel functions to address the extrapolation challenge. Drawing on the ALiBi approach, this study proposes a novel relative positional encoding method, called MEP, which employs a weighted average to combine distinct kernel functions(such as the exponential kernel and the Gaussian kernel) to generate a bias that is applied to post-softmax attention scores. Initially, the framework utilizes various kernel functions to construct multiple kernel functions. Each kernel function adheres to a consistent mean weight coefficient, harnessing the synergistic advantages of different kernels to formulate an innovative bias function. Subsequently, specific slopes are tailored for each kernel function, applying penalties at varying rates, to enhance the model's extrapolation capabilities. Finally, this bias is seamlessly incorporated as a penalty to the post-softmax scores. We present two distinct versions of our method: a parameter-free variant that requires no new learnable parameters, which enhances length extrapolation capabilities without compromising training efficiency, and a parameterized variant capable of integrating state-of-the-art techniques. Empirical evaluations across diverse datasets have demonstrated that both variants of our method achieve state-of-the-art performance, outperforming traditional parameter-free and parameterized approaches.
Paper Structure (9 sections, 28 equations, 3 figures, 6 tables)

This paper contains 9 sections, 28 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: (a) Previous positional encoding ALiBi produces a single exponential kernel function to post-softmax attention scores. For a transformer language model with $H$ attention heads, the range of $h$ is $n\cdot\frac{8}{H}$, where $n=\{1\dots H\}$. Left = the post-softmax self-attention matrix, right = the temporal biases matrix. (b) In contrast, the proposed MEP positional encoding builds a bias by multiple kernel functions to every post-softmax attention scores according to their distance. We employ multiple kernel learning, merging exponential, Gaussian, and polynomial kernels. In the exponential and Gaussian kernels, the range of $h$ aligns with that of ALiBi, while in the polynomial kernel, $h$ represents learned parameters. Left = the post-softmax self-attention matrix, middle = the exponential kernel temporal biases matrix, right = the Gaussian kernel temporal biases matrix. for example, $\alpha=0.5$ and $\beta=0.5$ is coeffient. exp denotes the Exponential kernel. $\text{slopes value=1}$.
  • Figure 2: Each point denotes the post-softmax attention score corresponding to the relative position $|i-j|$, obtained after passing through the kernel function. (a) Exponential kernel post-softmax attention scores, head = 2 to 8. (b) Gaussian kernel scores. (c) ours(MEP parameter-free model) MKL’s scores.
  • Figure 3: Exponential, Gaussian, and MEP function curves. The x-axis represents the relative position, $i-j$, from 0 to 511; the y-axis represents the value after the kernel function is applied.