Table of Contents
Fetching ...

Translation Equivariant Transformer Neural Processes

Matthew Ashman, Cristiana Diaconu, Junhyuck Kim, Lakee Sivaraya, Stratis Markou, James Requeima, Wessel P. Bruinsma, Richard E. Turner

TL;DR

This paper tackles learning translation-equivariant posterior predictions for data that are roughly stationary in space or time. It introduces Translation Equivariant Transformer Neural Processes (TE-TNPs) by replacing standard attention with translation-equivariant attention and by incorporating pseudo-tokens to lower computational complexity to $\mathcal{O}(MN)$ per layer. Theoretical results connect stationarity with translation equivariance of the predictive map and demonstrate improved generalisation under distribution shifts within the model's receptive field. Empirically, TE-TNPs outperform non-TE baselines and competitive NP variants across synthetic 1-D regression, image completion, Kolmogorov flow, and ERA5 environmental data, highlighting improved robustness to translations and strong spatio-temporal modelling capabilities.

Abstract

The effectiveness of neural processes (NPs) in modelling posterior prediction maps -- the mapping from data to posterior predictive distributions -- has significantly improved since their inception. This improvement can be attributed to two principal factors: (1) advancements in the architecture of permutation invariant set functions, which are intrinsic to all NPs; and (2) leveraging symmetries present in the true posterior predictive map, which are problem dependent. Transformers are a notable development in permutation invariant set functions, and their utility within NPs has been demonstrated through the family of models we refer to as TNPs. Despite significant interest in TNPs, little attention has been given to incorporating symmetries. Notably, the posterior prediction maps for data that are stationary -- a common assumption in spatio-temporal modelling -- exhibit translation equivariance. In this paper, we introduce of a new family of translation equivariant TNPs that incorporate translation equivariance. Through an extensive range of experiments on synthetic and real-world spatio-temporal data, we demonstrate the effectiveness of TE-TNPs relative to their non-translation-equivariant counterparts and other NP baselines.

Translation Equivariant Transformer Neural Processes

TL;DR

This paper tackles learning translation-equivariant posterior predictions for data that are roughly stationary in space or time. It introduces Translation Equivariant Transformer Neural Processes (TE-TNPs) by replacing standard attention with translation-equivariant attention and by incorporating pseudo-tokens to lower computational complexity to per layer. Theoretical results connect stationarity with translation equivariance of the predictive map and demonstrate improved generalisation under distribution shifts within the model's receptive field. Empirically, TE-TNPs outperform non-TE baselines and competitive NP variants across synthetic 1-D regression, image completion, Kolmogorov flow, and ERA5 environmental data, highlighting improved robustness to translations and strong spatio-temporal modelling capabilities.

Abstract

The effectiveness of neural processes (NPs) in modelling posterior prediction maps -- the mapping from data to posterior predictive distributions -- has significantly improved since their inception. This improvement can be attributed to two principal factors: (1) advancements in the architecture of permutation invariant set functions, which are intrinsic to all NPs; and (2) leveraging symmetries present in the true posterior predictive map, which are problem dependent. Transformers are a notable development in permutation invariant set functions, and their utility within NPs has been demonstrated through the family of models we refer to as TNPs. Despite significant interest in TNPs, little attention has been given to incorporating symmetries. Notably, the posterior prediction maps for data that are stationary -- a common assumption in spatio-temporal modelling -- exhibit translation equivariance. In this paper, we introduce of a new family of translation equivariant TNPs that incorporate translation equivariance. Through an extensive range of experiments on synthetic and real-world spatio-temporal data, we demonstrate the effectiveness of TE-TNPs relative to their non-translation-equivariant counterparts and other NP baselines.
Paper Structure (36 sections, 2 theorems, 39 equations, 9 figures, 7 tables)

This paper contains 36 sections, 2 theorems, 39 equations, 9 figures, 7 tables.

Key Result

Theorem 2.1

(1) The ground-truth stochastic process $P$ is stationary and $\pi'_P$ is translation invariant if and only if (2) $\pi_P$ is translation equivariant.

Figures (9)

  • Figure 1: Block diagrams illustrating the and encoder architectures. For both models, we pass individual datapoints through pointwise MLPs to obtain the initial token representations, $\mathbf{Z}^0_c$ and $\mathbf{Z}^0_t$. These are then passed through multiple attention layers, with the context tokens interacting with the target tokens through cross-attention. The output of the encoder depends on $\mathcal{D}_c$and$\mathbf{X}_t$. The encoder updates the input locations at each layer, in addition to the tokens.
  • Figure 2: Average test log-likelihood ($\boldsymbol{\mathbf{\uparrow}}$) for Kolmogorov flow. Standard errors are shown.
  • Figure 3: Translation equivariance in combination with a limited receptive field (see (a)) can help generalisation performance. Consider a translation equivariant (TE) model which performs well within a training range (see (b)). Consider a prediction for a target input outside the training range (right triangle in (b)). If the model has receptive field $R > 0$ and the training range is bigger than $R$, then TE can be used to "shift that prediction back into the training range" (see (b)). Since the model performs well within the training range, the model also performs well for the target input outside the training range.
  • Figure 4: A comparison between the predictive distributions on a single synthetic-1D regression dataset of the , , and when the data is shifted by amount $\Delta = 0$ (top) and $\Delta = 2$ (bottom). Observe that the and models exhibit translation equivariance, whereas the and models do not. Context points are shown in black, and the ground-truth predictive mean and $\pm$ standard deviation are shown in dashed-purple.
  • Figure 5: A comparison between the vorticity at a single point in time, computed using the predicted velocities for a single test Kolmogorov flow dataset. Here, the proportion of datapoints in the context set is 10%.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Theorem 2.1
  • Theorem 2.2
  • proof : Proof of \ref{['thm:te_iff_stat']}
  • proof : Proof of \ref{['thm:generalisation']}