Rough Transformers: Lightweight and Continuous Time Series Modelling through Signature Patching

Fernando Moreno-Pino; Álvaro Arroyo; Harrison Waldon; Xiaowen Dong; Álvaro Cartea

Rough Transformers: Lightweight and Continuous Time Series Modelling through Signature Patching

Fernando Moreno-Pino, Álvaro Arroyo, Harrison Waldon, Xiaowen Dong, Álvaro Cartea

TL;DR

Rough Transformers address the challenge of modelling long, irregular time-series by lifting discrete inputs to continuous-time representations using path signatures and a novel multi-view attention mechanism. This approach reduces the quadratic attention cost and enables robustness to irregular sampling, while simultaneously enhancing spatial processing across channels. Empirical results show that Rough Transformers (RFormer) outperform vanilla Transformers and several continuous-time baselines on synthetic and real long-sequence tasks, with substantial training-time speedups. The work demonstrates a practical, scalable pathway for continuous-time sequence modelling with meaningful gains in efficiency and accuracy, enabling broader applications in domains with irregularly sampled data.

Abstract

Time-series data in real-world settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In these settings, traditional sequence-based recurrent models struggle. To overcome this, researchers often replace recurrent architectures with Neural ODE-based models to account for irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of even moderate length. To address this challenge, we introduce the Rough Transformer, a variation of the Transformer model that operates on continuous-time representations of input sequences and incurs significantly lower computational costs. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global (multi-scale) dependencies in the input data, while remaining robust to changes in the sequence length and sampling frequency and yielding improved spatial processing. We find that, on a variety of time-series-related tasks, Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the representational benefits of Neural ODE-based models, all at a fraction of the computational time and memory resources.

Rough Transformers: Lightweight and Continuous Time Series Modelling through Signature Patching

TL;DR

Abstract

Paper Structure (35 sections, 5 theorems, 22 equations, 9 figures, 24 tables)

This paper contains 35 sections, 5 theorems, 22 equations, 9 figures, 24 tables.

Introduction
Background and Methodology
Problem Formulation.
Sequence Modelling with Transformers.
Rough Path Signatures.
Rough Transformers
Advantages of Rough Transformers
Experiments
Time Series Processing
Frequency Classification.
Training Efficiency
Irregular Time Series Classification
Reasons for improved model performance
Spatial Processing
Sequence Coarsening as an Inductive Bias for Transformers
...and 20 more sections

Key Result

Proposition 3.1

Let $\mathbb{T}$ be a Rough Transformer. Suppose $\widehat{X}: [0, T] \rightarrow \mathbb{R}^d$ is a continuous-time process, and let $\gamma : [0, T] \rightarrow [0, T]$ denote a time-reparameterization. Suppose $\mathbf{X}$ and $\mathbf{X}'$ are samplings of $\widehat{X}$ and $\widehat{X}\circ \ga

Figures (9)

Figure 1: A representation of the multi-view signature. The continuous-time path is irregularly sampled at points marked with a red $x$. The local and global signatures of a linear interpolation of these points are computed and concatenated to form the multi-view signature. The multi-view signature transform consists of $\overline{L}$ multi-view signatures.
Figure 2: Seconds per epoch for growing input length and for different model types on the sinusoidal dataset. Left: Log Scale. Middle: Regular Scale. Right: Log-log scale. When a line stops, it indicates an OOM error.
Figure 3: Test accuracy per epoch for the frequency classification task across three random seeds. Left: Sinusoidal dataset. Right: Long Sinusoidal dataset.
Figure 4: Average performance of all models on the 15 univariate datasets from the UEA Time Series archive under different degrees of data drop.
Figure 5: Left: Graph connectivity structures for multivariate, univariate and sparse signature. Middle: Example samples for synthetic task. Right: Performance on spatial synthetic experiment.
...and 4 more figures

Theorems & Definitions (6)

Proposition 3.1
proof
Proposition A.1
Theorem A.2
Proposition A.3
Proposition A.4: Chen's Relation

Rough Transformers: Lightweight and Continuous Time Series Modelling through Signature Patching

TL;DR

Abstract

Rough Transformers: Lightweight and Continuous Time Series Modelling through Signature Patching

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (6)