Table of Contents
Fetching ...

Rough Transformers for Continuous and Efficient Time-Series Modelling

Fernando Moreno-Pino, Álvaro Arroyo, Harrison Waldon, Xiaowen Dong, Álvaro Cartea

TL;DR

The Rough Transformer is introduced, a variation of the Transformer model which operates on continuous-time representations of input sequences and incurs significantly reduced computational costs, critical for addressing long-range dependencies common in medical contexts.

Abstract

Time-series data in real-world medical settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In such contexts, traditional sequence-based recurrent models struggle. To overcome this, researchers replace recurrent architectures with Neural ODE-based models to model irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of moderate lengths and greater. To mitigate this, we introduce the Rough Transformer, a variation of the Transformer model which operates on continuous-time representations of input sequences and incurs significantly reduced computational costs, critical for addressing long-range dependencies common in medical contexts. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global dependencies in input data, while remaining robust to changes in the sequence length and sampling frequency. We find that Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the benefits of Neural ODE-based models using a fraction of the computational time and memory resources on synthetic and real-world time-series tasks.

Rough Transformers for Continuous and Efficient Time-Series Modelling

TL;DR

The Rough Transformer is introduced, a variation of the Transformer model which operates on continuous-time representations of input sequences and incurs significantly reduced computational costs, critical for addressing long-range dependencies common in medical contexts.

Abstract

Time-series data in real-world medical settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In such contexts, traditional sequence-based recurrent models struggle. To overcome this, researchers replace recurrent architectures with Neural ODE-based models to model irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of moderate lengths and greater. To mitigate this, we introduce the Rough Transformer, a variation of the Transformer model which operates on continuous-time representations of input sequences and incurs significantly reduced computational costs, critical for addressing long-range dependencies common in medical contexts. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global dependencies in input data, while remaining robust to changes in the sequence length and sampling frequency. We find that Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the benefits of Neural ODE-based models using a fraction of the computational time and memory resources on synthetic and real-world time-series tasks.
Paper Structure (13 sections, 4 theorems, 13 equations, 1 figure, 5 tables)

This paper contains 13 sections, 4 theorems, 13 equations, 1 figure, 5 tables.

Key Result

Proposition A.1

Given a smooth path $\widehat{X}: [0, T] \rightarrow \mathbb{R}^d$, then the map $P_{\widehat{X}}: [0, T] \rightarrow \mathbb{R}^{1+d}$ where $P_{\widehat{X}}(t)= (t, \widehat{X}(t))$ is uniquely determined by it's signature $S(P_{\widehat{X}})_{0, T}$.

Figures (1)

  • Figure 1: Test accuracy per epoch for the frequency classification task across three random seeds. Left: Performance for the full time-series. Right: Performance when randomly dropping half of the datapoints every epoch.

Theorems & Definitions (4)

  • Proposition A.1
  • Theorem A.2
  • Proposition A.3
  • Proposition A.4: Chen's Relation