Table of Contents
Fetching ...

Positional Encoding Helps Recurrent Neural Networks Handle a Large Vocabulary

Takashi Morita

TL;DR

Investigations through synthetic benchmarks reveal an advantage of coupling positional encoding and RNNs, especially for handling a large vocabulary that yields low-frequency tokens, and shed a new light on the utility of positional encoding beyond its canonical role as a timekeeper for Transformers.

Abstract

This study reports an unintuitive finding that positional encoding enhances learning of recurrent neural networks (RNNs). Positional encoding is a high-dimensional representation of time indices on input data. Most famously, positional encoding complements the capabilities of Transformer neural networks, which lack an inherent mechanism for representing the data order. By contrast, RNNs can encode the temporal information of data points on their own, rendering their use of positional encoding seemingly redundant/unnecessary. Nonetheless, investigations through synthetic benchmarks reveal an advantage of coupling positional encoding and RNNs, especially for handling a large vocabulary that yields low-frequency tokens. Further scrutinization unveils that these low-frequency tokens destabilizes the gradients of vanilla RNNs, and the positional encoding resolves this instability. These results shed a new light on the utility of positional encoding beyond its canonical role as a timekeeper for Transformers.

Positional Encoding Helps Recurrent Neural Networks Handle a Large Vocabulary

TL;DR

Investigations through synthetic benchmarks reveal an advantage of coupling positional encoding and RNNs, especially for handling a large vocabulary that yields low-frequency tokens, and shed a new light on the utility of positional encoding beyond its canonical role as a timekeeper for Transformers.

Abstract

This study reports an unintuitive finding that positional encoding enhances learning of recurrent neural networks (RNNs). Positional encoding is a high-dimensional representation of time indices on input data. Most famously, positional encoding complements the capabilities of Transformer neural networks, which lack an inherent mechanism for representing the data order. By contrast, RNNs can encode the temporal information of data points on their own, rendering their use of positional encoding seemingly redundant/unnecessary. Nonetheless, investigations through synthetic benchmarks reveal an advantage of coupling positional encoding and RNNs, especially for handling a large vocabulary that yields low-frequency tokens. Further scrutinization unveils that these low-frequency tokens destabilizes the gradients of vanilla RNNs, and the positional encoding resolves this instability. These results shed a new light on the utility of positional encoding beyond its canonical role as a timekeeper for Transformers.
Paper Structure (24 sections, 3 equations, 15 figures, 1 table)

This paper contains 24 sections, 3 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Illustration of the model structure and the reverse-ordering task.
  • Figure 2: Token-wise accuracy (left) and sequence-wise reconstruction errors (right) of the reverse-ordering task performed by GRU/LSTM/S4D with and without positional encoding (labeled as "Position-Encoded" and "Vanilla" respectively). The input length was fixed at $64$. The error bars represent the 95% confidence interval estimated from 10,000 bootstrapped samples of five training-test trials with different random seeds. Each of the five trials held out 1024 random sequences (= 65,536 tokens) for computing the test accuracy.
  • Figure 3: Token-wise accuracy of the reverse-ordering task performed by GRU/LSTM/S4D with and without positional encoding (labeled as "Position-Encoded" and "Vanilla" respectively). The vocabulary was evenly split into Frequent and Rare groups (32+32 for GRU, 512+512 for LSTM, and 1024+1024 for S4D), and the former was sampled three times more frequently than the latter. The input length was fixed at $64$. The error bars represent the 95% confidence interval estimated from 10,000 bootstrapped samples of five training-test trials with different random seeds. Each of the five trials held out 4096 test sequences (= 262,144 tokens), consisting of a single "target" token (frequent or rare) surrounded by 63 "disturbants" (all frequent or all rare). That is, sixteen test sequences were held out for each condition (frequent/rare target $\times$ frequent/rare disturbants $\times$ target positions).
  • Figure 4: Schematic illustration of the analysis of gradient stability. The RNN trained on the reverse-ordering task processed a pair of input sequences that shared the initial token ($t=1$; blue) but differed in the rest ($2\leq t\leq L$; referred to as "Disturbant A/B", and colored in orange/purple). For each dimension $i$ of the final RNN output $h_{2L,i}$ at time $2L$, the distant gradient $\left( \frac{\partial h_{2L,i}}{\partial z_{1,1}}, \dots, \frac{\partial h_{2L,i}}{\partial z_{1,(2)D}} \right)^T$ at the first updated latent state $\vec{\mathbf{z}}_1$ ($=\vec{\mathbf{h}}_1$ in GRU; $=$ concatenation of the hidden and cell states in LSTM, doubling the total dimensionality to $2D$) was computed per input sequence via backpropagation through time (dashed lines). The gradient stability was defined by the dot-product similarity of the paired gradients normalized over the output dimensions by the coefficients $\alpha_i^{(s)} (s \in \{A,B\})$, whose definition is provided in Eq. \ref{['eq:normalizer']}.
  • Figure 5: Gradient stability of GRU/LSTM/S4D trained on the reverse-ordering task with and without positional encoding (labeled as "Position-Encoded" and "Vanilla" respectively). For the GRU and LSTM, the stability was defined by the dot-product similarity of latent-to-latent gradients after normalization over the output dimensions, conditioned on two input sequences sharing the initial "target" token (whose Frequent vs. Rare distinction is represented by the line color), followed by Frequent or Rare disturbants (represented by the solid vs. dashed lines). For the S4D, the target token was positioned at $t=23$, where the vanilla model scored the worst accuracy with the Rare disturbants. The disturbants were prefixed and suffixed to the target to construct input sequences. The prefix disturbants were shared between the paired sequences, ensuring that the latent dynamics of the model was guaranteed to remain identical up to the target token. The total input length was $1+63 = 22+1+41 = 64$. The average similarity over 1024 input pairs times five trials is plotted for every 5000 training iterations.
  • ...and 10 more figures