Test Time Learning for Time Series Forecasting

Panayiotis Christou; Shichu Chen; Xupeng Chen; Parijat Dube

Test Time Learning for Time Series Forecasting

Panayiotis Christou, Shichu Chen, Xupeng Chen, Parijat Dube

TL;DR

The paper tackles long-term time series forecasting (LTSF) under non-stationarity and long-range dependencies, where Transformer-based methods are costly and state-space models have limitations. It introduces Test-Time Training (TTT) modules embedded in a TimeMachine backbone, featuring a two-level hierarchical embedding, convolutional context submodules, and two modes of channel processing to enable dynamic adaptation with linear sequence-length complexity. Across seven benchmark datasets, TimeMachine-TTT variants outperform state-of-the-art models, particularly on longer horizons and larger datasets, with Conv Stack 5 often delivering the strongest results. The work also presents extensive ablations, longer sequence/prediction-length experiments, and complexity analyses, positioning TTT as a scalable and effective direction for high-performance LTSF and offering a foundation for future architectural exploration.

Abstract

Time-series forecasting has seen significant advancements with the introduction of token prediction mechanisms such as multi-head attention. However, these methods often struggle to achieve the same performance as in language modeling, primarily due to the quadratic computational cost and the complexity of capturing long-range dependencies in time-series data. State-space models (SSMs), such as Mamba, have shown promise in addressing these challenges by offering efficient solutions with linear RNNs capable of modeling long sequences with larger context windows. However, there remains room for improvement in accuracy and scalability. We propose the use of Test-Time Training (TTT) modules in a parallel architecture to enhance performance in long-term time series forecasting. Through extensive experiments on standard benchmark datasets, we demonstrate that TTT modules consistently outperform state-of-the-art models, including the Mamba-based TimeMachine, particularly in scenarios involving extended sequence and prediction lengths. Our results show significant improvements in Mean Squared Error (MSE) and Mean Absolute Error (MAE), especially on larger datasets such as Electricity, Traffic, and Weather, underscoring the effectiveness of TTT in capturing long-range dependencies. Additionally, we explore various convolutional architectures within the TTT framework, showing that even simple configurations like 1D convolution with small filters can achieve competitive results. This work sets a new benchmark for time-series forecasting and lays the groundwork for future research in scalable, high-performance forecasting models.

Test Time Learning for Time Series Forecasting

TL;DR

Abstract

Paper Structure (74 sections, 37 equations, 5 figures, 7 tables)

This paper contains 74 sections, 37 equations, 5 figures, 7 tables.

Introduction
Related Work
Transformers for LTSF
State Space Models for LTSF
Linear RNNs for LTSF
MLPs and CNNs for LTSF
Model Architecture
General Architecture
Hierarchical Embedding
Two Level Contextual Cue Modeling
Final Prediction
Channel Mixing and Independence Modes
Experiments and Evaluation
Original Experimental Setup
Quantitative Results
...and 59 more sections

Figures (5)

Figure 1: TimeMachine incoporating TTT-Blocks
Figure 2: Channel Mixing Mode
Figure 3: Channel Independence Mode
Figure 5: Average MSE and MAE comparison of our model and SOTA baselines with L = 720. The circle center represents the maximum possible error. Closer to the boundary indicates better performance.
Figure 6: Convolutional Hidden Layer Added to the Beginning of the TTT Block. This basic residual building block is similar to the one used in Transformer models. We use the Hidden Layer as part of an ablation study to evaluate the effects of different hidden layer architectures on model performance. The five configurations are detailed below: (1) 1D Convolution with kernel size 3. (2) 1D Convolution with kernel size 5. (3) Two 1D Convolutions with kernel sizes 5 and 3 in cascade.(4) Two 1D Convolutions with kernel size 3 in cascade. (5) An Inception Block combining 1D Convolutions with kernel sizes 5 and 3, followed by concatenation and reduction to the original size. The Sequence Modeling Block of TTT can be used with two different backbones: the Mamba Backbone and the Transformer Backbone.

Test Time Learning for Time Series Forecasting

TL;DR

Abstract

Test Time Learning for Time Series Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (5)