Table of Contents
Fetching ...

Location Attention for Extrapolation to Longer Sequences

Yann Dubois, Gautier Dagan, Dieuwke Hupkes, Elia Bruni

TL;DR

Problem: standard neural seq2seq models struggle to extrapolate to sequences longer than those seen during training. Approach: introduce a Location Attender that models position-based glimpses with a Gaussian over relative positions, plus a Mix Attender that convexly combines content and location attention, evaluated on controlled Lookup Table tasks designed to stress extrapolation. Findings: location-based attention improves long-sequence extrapolation over strong baselines; Mix Attender further aligns attention with target patterns and helps in harder variants, though EOS-related failures prevent perfect extrapolation. Significance: demonstrates that explicit location-based biases can enable extrapolation in sequence processing and points to future work toward removing brittle heuristics and applying these ideas to self-attention architectures.

Abstract

Neural networks are surprisingly good at interpolating and perform remarkably well when the training set examples resemble those in the test set. However, they are often unable to extrapolate patterns beyond the seen data, even when the abstractions required for such patterns are simple. In this paper, we first review the notion of extrapolation, why it is important and how one could hope to tackle it. We then focus on a specific type of extrapolation which is especially useful for natural language processing: generalization to sequences that are longer than the training ones. We hypothesize that models with a separate content- and location-based attention are more likely to extrapolate than those with common attention mechanisms. We empirically support our claim for recurrent seq2seq models with our proposed attention on variants of the Lookup Table task. This sheds light on some striking failures of neural models for sequences and on possible methods to approaching such issues.

Location Attention for Extrapolation to Longer Sequences

TL;DR

Problem: standard neural seq2seq models struggle to extrapolate to sequences longer than those seen during training. Approach: introduce a Location Attender that models position-based glimpses with a Gaussian over relative positions, plus a Mix Attender that convexly combines content and location attention, evaluated on controlled Lookup Table tasks designed to stress extrapolation. Findings: location-based attention improves long-sequence extrapolation over strong baselines; Mix Attender further aligns attention with target patterns and helps in harder variants, though EOS-related failures prevent perfect extrapolation. Significance: demonstrates that explicit location-based biases can enable extrapolation in sequence processing and points to future work toward removing brittle heuristics and applying these ideas to self-attention architectures.

Abstract

Neural networks are surprisingly good at interpolating and perform remarkably well when the training set examples resemble those in the test set. However, they are often unable to extrapolate patterns beyond the seen data, even when the abstractions required for such patterns are simple. In this paper, we first review the notion of extrapolation, why it is important and how one could hope to tackle it. We then focus on a specific type of extrapolation which is especially useful for natural language processing: generalization to sequences that are longer than the training ones. We hypothesize that models with a separate content- and location-based attention are more likely to extrapolate than those with common attention mechanisms. We empirically support our claim for recurrent seq2seq models with our proposed attention on variants of the Lookup Table task. This sheds light on some striking failures of neural models for sequences and on possible methods to approaching such issues.

Paper Structure

This paper contains 23 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Schematic extrapolation setting for $d=2$.
  • Figure 2: Attender in a recurrent seq2seq.
  • Figure 3: Proposed Location Attender. Given a resized query, the Weighter outputs the standard deviation $\sigma_t$ and $\pmb{\rho}_t$ which will weight the building blocks $\mathbf{b}_t$ to compute the mean $\mu_t$. $\mu_t$ and $\sigma_t$ parametrize a Gaussian PDF used to compute the location attention $\pmb{\lambda}_t$.
  • Figure 4: Soft staircase activation function.
  • Figure 5: Mix Attender. The output $\pmb \alpha_t$ is a convex combination of the content and location attention.
  • ...and 4 more figures