Taylorformer: Probabilistic Modelling for Random Processes including Time Series

Omer Nivron; Raghul Parthipan; Damon J. Wischik

Taylorformer: Probabilistic Modelling for Random Processes including Time Series

Omer Nivron, Raghul Parthipan, Damon J. Wischik

TL;DR

The Taylorformer approximates a consistent stochastic process and provides uncertainty-aware predictions, and has at least a 14\% MSE improvement on forecasting tasks, including electricity, oil temperatures and exchange rates.

Abstract

We propose the Taylorformer for random processes such as time series. Its two key components are: 1) the LocalTaylor wrapper which adapts Taylor approximations (used in dynamical systems) for use in neural network-based probabilistic models, and 2) the MHA-X attention block which makes predictions in a way inspired by how Gaussian Processes' mean predictions are linear smoothings of contextual data. Taylorformer outperforms the state-of-the-art in terms of log-likelihood on 5/6 classic Neural Process tasks such as meta-learning 1D functions, and has at least a 14\% MSE improvement on forecasting tasks, including electricity, oil temperatures and exchange rates. Taylorformer approximates a consistent stochastic process and provides uncertainty-aware predictions. Our code is provided in the supplementary material.

Taylorformer: Probabilistic Modelling for Random Processes including Time Series

TL;DR

Abstract

Paper Structure (44 sections, 17 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 44 sections, 17 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Problem setup.
Our Contributions.
Related Work
Neural Processes and Consistency.
Forecasting.
Our approach: Taylorformer
LocalTaylor
MHA-X block
Training
MHA-XY
MHA-X
Experiments
Neural Process tasks
Datasets
...and 29 more sections

Figures (6)

Figure 1: Taylorformer (left) can generate higher quality samples on the ETT Wu2021AutoformerDT dataset than the state-of-the-art Autoformer Wu2021AutoformerDT (right). This is representative of the general difference between these models at generation time. The task is to predict the next 48 hours (192 target points) given a 24-hour window (96 context points).
Figure 2: a) Taylorformer architecture corresponding to equation \ref{['eq:our_approach']}. LocalTaylor is a wrapper around the central neural network, MHA-X-Net. The channels on the right-hand side are MHA-X. The ones on the left are MHA-XY. The noted features are shown in the following equations: XY features with masking, eq. \ref{['eq:query_xy']}, XY features, eq. \ref{['eq:value_xy']}, and X features, eq. \ref{['eq:query_x']}. b) Example mask for $n_C = 3$ and $n_T = 3$. Each token can attend to other shaded tokens in its row.
Figure 3: By shuffling the target variables given the context during training, we drive the model to be approximately target equivariant. If all log-likelihood scores are equal, their standard deviation will be zero. The histograms show these standard deviations for training with (a) one permuted sample and (b) five permuted samples on the RBF task. We can see that the models are 'close' to consistency. Furthermore, for the specific RBF task, one permuted sample during training seems suitable to drive consistency.
Figure 4: Validation set negative log-likelihood (NLL). Lower is better. Our Taylorformer outperforms NPGarnelo2018NeuralP, ANP Kim2019AttentiveNP and TNP pmlr-v162-nguyen22b on the meta-learning 1D regression task (see task details in the main text).
Figure 5: Ablations for our model showing that using both the LocalTaylor wrapper and the MHA-X together (green line) contributes to the improved results. This is shown for a 1D regression task (GP RBF kernel).
...and 1 more figures

Taylorformer: Probabilistic Modelling for Random Processes including Time Series

TL;DR

Abstract

Taylorformer: Probabilistic Modelling for Random Processes including Time Series

Authors

TL;DR

Abstract

Table of Contents

Figures (6)