Table of Contents
Fetching ...

Improving Transformers using Faithful Positional Encoding

Tsuyoshi Idé, Jokin Labaien, Pin-Yu Chen

TL;DR

The paper addresses the problem that standard Transformer positional encoding (PE), based on a sinusoidal basis with frequencies $w_k$, may fail to faithfully preserve position information due to low-pass characteristics. It introduces a faithfulness notion for PE and derives a Discrete Fourier Transform (DFT) based positional encoding by encoding the one-hot position function $f_s(t)=\delta_{s,t}$ via its DFT coefficients, yielding a faithful and invertible representation. The main contributions include formalizing faithfulness, deriving the DFT PE with a flat Fourier coefficient distribution, proving its reconstructability, and demonstrating consistent improvements in time-series classification on Elevator, SMD, and MSL datasets. The findings suggest that capturing short- to mid-range positional dependencies with a principled, mathematically grounded encoding can enhance Transformer performance in sequential tasks where local position matters.

Abstract

We propose a new positional encoding method for a neural network architecture called the Transformer. Unlike the standard sinusoidal positional encoding, our approach is based on solid mathematical grounds and has a guarantee of not losing information about the positional order of the input sequence. We show that the new encoding approach systematically improves the prediction performance in the time-series classification task.

Improving Transformers using Faithful Positional Encoding

TL;DR

The paper addresses the problem that standard Transformer positional encoding (PE), based on a sinusoidal basis with frequencies , may fail to faithfully preserve position information due to low-pass characteristics. It introduces a faithfulness notion for PE and derives a Discrete Fourier Transform (DFT) based positional encoding by encoding the one-hot position function via its DFT coefficients, yielding a faithful and invertible representation. The main contributions include formalizing faithfulness, deriving the DFT PE with a flat Fourier coefficient distribution, proving its reconstructability, and demonstrating consistent improvements in time-series classification on Elevator, SMD, and MSL datasets. The findings suggest that capturing short- to mid-range positional dependencies with a principled, mathematically grounded encoding can enhance Transformer performance in sequential tasks where local position matters.

Abstract

We propose a new positional encoding method for a neural network architecture called the Transformer. Unlike the standard sinusoidal positional encoding, our approach is based on solid mathematical grounds and has a guarantee of not losing information about the positional order of the input sequence. We show that the new encoding approach systematically improves the prediction performance in the time-series classification task.
Paper Structure (13 sections, 1 theorem, 23 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 13 sections, 1 theorem, 23 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

The DFT encoding is faithful.

Figures (2)

  • Figure 1: The distribution of the frequency $w_k = \rho^{-\frac{k}{d}}$ in Eq. \ref{['eq:VanilaPE_wk']} over the Fourier bases (black line), where $\rho=10000,d=256$. Notice the contrast to that of the DFT encoding, which gives a uniform distribution (red line; See Section \ref{['sec:DFT_encoding']}).
  • Figure 2: Reconstruction by Algorithm \ref{['algo:reference_reconstruction']} for the location function $f(t)=\delta_{t,5},\ \delta_{t,40}, \ \delta_{t,75}$. Perfect reconstruction corresponds to single-peaked spikes at $t=5, 40, 75$, respectively. The broad distributions given by the original PE (top) demonstrate a significant loss of information in the original PE. See Section \ref{['sec:DFT_encoding']} for DFT encoding (bottom).

Theorems & Definitions (2)

  • Definition 1: Faithfulness of PE
  • Theorem 1