Table of Contents
Fetching ...

Gridded Transformer Neural Processes for Large Unstructured Spatio-Temporal Data

Matthew Ashman, Cristiana Diaconu, Eric Langezaal, Adrian Weller, Richard E. Turner

TL;DR

The introduction of gridded pseudo-token TNPs which employ specialised encoders and decoders to handle unstructured observations and utilise a processor containing gridded pseudo-tokens that leverage efficient attention mechanisms to bring performance and computational benefits when applied at scale in a weather modelling pipeline.

Abstract

Many important problems require modelling large-scale spatio-temporal datasets, with one prevalent example being weather forecasting. Recently, transformer-based approaches have shown great promise in a range of weather forecasting problems. However, these have mostly focused on gridded data sources, neglecting the wealth of unstructured, off-the-grid data from observational measurements such as those at weather stations. A promising family of models suitable for such tasks are neural processes (NPs), notably the family of transformer neural processes (TNPs). Although TNPs have shown promise on small spatio-temporal datasets, they are unable to scale to the quantities of data used by state-of-the-art weather and climate models. This limitation stems from their lack of efficient attention mechanisms. We address this shortcoming through the introduction of gridded pseudo-token TNPs which employ specialised encoders and decoders to handle unstructured observations and utilise a processor containing gridded pseudo-tokens that leverage efficient attention mechanisms. Our method consistently outperforms a range of strong baselines on various synthetic and real-world regression tasks involving large-scale data, while maintaining competitive computational efficiency. The real-life experiments are performed on weather data, demonstrating the potential of our approach to bring performance and computational benefits when applied at scale in a weather modelling pipeline.

Gridded Transformer Neural Processes for Large Unstructured Spatio-Temporal Data

TL;DR

The introduction of gridded pseudo-token TNPs which employ specialised encoders and decoders to handle unstructured observations and utilise a processor containing gridded pseudo-tokens that leverage efficient attention mechanisms to bring performance and computational benefits when applied at scale in a weather modelling pipeline.

Abstract

Many important problems require modelling large-scale spatio-temporal datasets, with one prevalent example being weather forecasting. Recently, transformer-based approaches have shown great promise in a range of weather forecasting problems. However, these have mostly focused on gridded data sources, neglecting the wealth of unstructured, off-the-grid data from observational measurements such as those at weather stations. A promising family of models suitable for such tasks are neural processes (NPs), notably the family of transformer neural processes (TNPs). Although TNPs have shown promise on small spatio-temporal datasets, they are unable to scale to the quantities of data used by state-of-the-art weather and climate models. This limitation stems from their lack of efficient attention mechanisms. We address this shortcoming through the introduction of gridded pseudo-token TNPs which employ specialised encoders and decoders to handle unstructured observations and utilise a processor containing gridded pseudo-tokens that leverage efficient attention mechanisms. Our method consistently outperforms a range of strong baselines on various synthetic and real-world regression tasks involving large-scale data, while maintaining competitive computational efficiency. The real-life experiments are performed on weather data, demonstrating the potential of our approach to bring performance and computational benefits when applied at scale in a weather modelling pipeline.

Paper Structure

This paper contains 66 sections, 16 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: A unifying construction of CNPs, with $\mathcal{D}_c = \{(\mathbf{x}_{c, n}, \mathbf{y}_{c, n})\}_{n}$ and $\mathbf{z}_{c, n} = e(\mathbf{x}_{c, n}, \mathbf{y}_{c, n})$.
  • Figure 2: An illustrative demonstration of the complete gridded TNP pipeline. Following the CNP constrution in \ref{['subsec:unifying-construction']}, we highlight the encoder (blue), processor (red) and decoder (green).
  • Figure 3: An illustrative demonstration of the pseudo-token grid encoder in the 2-D case. To achieve an efficient implementation of cross-attention to the pseudo-token grid, we pad sets of neighbourhood tokens with 'dummy' tokens, so that each neighbourhood has the same cardinality.
  • Figure 4: Plots comparing the test log-likelihood vs. forward pass time (FPT) for the two synthetic GP datasets. For each model, we show the results for a large and small (transparent) version. The baselines have hatched markers. The grid sizes we consider are $64\times 64$ and $32\times 32$, shown as $64$ and $32$. For the ViTNP models, we include results with and without patch encoding, the former indicated by the $\rightarrow$ symbol in-between the pre- and post-patch-encoded grid sizes. We make use of the following acronyms. KI: kernel-interpolation grid encoding. PT: pseudo-token grid encoding.
  • Figure 5: A comparison between the predictive error of the 2m temperature at the US weather station locations at 15:00, 28-01-2019. Stations included in the context dataset are shown as black crosses ($\approx 3\%$ of station locations). The Swin-TNP uses the PT-GE. Both the Swin-TNP and ConvCNP use a grid size of $64\times 128$. The PT-TNP uses $M=256$ pseudo-tokens. The mean log-likelihoods of this sample for the three models are $1.611$, $1.351$, and $1.271$, respectively.
  • ...and 15 more figures