Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Kashif Rasul; Arjun Ashok; Andrew Robert Williams; Hena Ghonia; Rishika Bhagwatkar; Arian Khorasani; Mohammad Javad Darvishi Bayazi; George Adamopoulos; Roland Riachi; Nadhir Hassen; Marin Biloš; Sahil Garg; Anderson Schneider; Nicolas Chapados; Alexandre Drouin; Valentina Zantedeschi; Yuriy Nevmyvaka; Irina Rish

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Hena Ghonia, Rishika Bhagwatkar, Arian Khorasani, Mohammad Javad Darvishi Bayazi, George Adamopoulos, Roland Riachi, Nadhir Hassen, Marin Biloš, Sahil Garg, Anderson Schneider, Nicolas Chapados, Alexandre Drouin, Valentina Zantedeschi, Yuriy Nevmyvaka, Irina Rish

TL;DR

Lag-Llama introduces a decoder-only transformer for univariate probabilistic time series forecasting trained on a large, diverse corpus to enable zero-shot generalization and strong few-shot adaptation across domains. It employs lag-based tokenization, RMSNormRoPE-style encoding, a simple Student-t distribution head, and robust value scaling, coupled with strategic pretraining and augmentation. Empirical results show competitive zero-shot performance and state-of-the-art or near-state-of-the-art results after finetuning across 27 datasets from multiple domains, with strong few-shot gains as history increases. The work supports the viability of foundation-model-style approaches for time series and provides scaling and diversity analyses to guide future research.

Abstract

Over the past years, foundation models have caused a paradigm shift in machine learning due to their unprecedented capabilities for zero-shot and few-shot generalization. However, despite the success of foundation models in modalities such as natural language processing and computer vision, the development of foundation models for time series forecasting has lagged behind. We present Lag-Llama, a general-purpose foundation model for univariate probabilistic time series forecasting based on a decoder-only transformer architecture that uses lags as covariates. Lag-Llama is pretrained on a large corpus of diverse time series data from several domains, and demonstrates strong zero-shot generalization capabilities compared to a wide range of forecasting models on downstream datasets across domains. Moreover, when fine-tuned on relatively small fractions of such previously unseen datasets, Lag-Llama achieves state-of-the-art performance, outperforming prior deep learning approaches, emerging as the best general-purpose model on average. Lag-Llama serves as a strong contender to the current state-of-art in time series forecasting and paves the way for future advancements in foundation models tailored to time series data.

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

TL;DR

Abstract

Paper Structure (32 sections, 5 equations, 13 figures, 9 tables)

This paper contains 32 sections, 5 equations, 13 figures, 9 tables.

Introduction
Related Work
Probabilistic Time Series Forecasting
Lag-Llama
Tokenization: Lag Features
Lag-Llama Architecture
Choice of Distribution Head
Value Scaling
Training Strategies
Experimental Setup
Datasets
Baselines
Hyperparameter Search and Model Training Setups
Inference and Model Evaluation
Results
...and 17 more sections

Figures (13)

Figure 1: For a time series, we depict the tokenization at the timestep $t$ of the value $x_t$ which contains lag features constructed using an example set of lag indices $\mathcal{L}$, where each value in the vector is from the past of $x_t$ (in blue), and $F$ possible temporal covariates (date-time features) constructed from timestamp $t$ (red).
Figure 2: The Lag-Llama architecture. Lag-Llama learns to output a distribution over the values of the next time step based on lagged input features. The input to the model is the token of a univariate time series $i$ at a given timestep, $\mathbf{x}^i_t$, constructed as described in Sec.\ref{['sec:lagFeatures']}. Here, we use $\mathbf{c}_t^i$ to refer to all additional covariates used along with the value at a timestep $t$, which include the $|\mathcal{L}|$ lags, $F$ date-time features, and summary statistics. The inputs are projected through $M$ masked decoder layers. The features are then passed through the distribution head and trained to predict the parameters of the forecast distribution of the next timestep.
Figure 3: Forecasting examples on the Electricity Hourly dataset
Figure 4: Forecasting examples from ETT-H2 dataset
Figure 5: Forecasting examples from Traffic dataset
...and 8 more figures

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

TL;DR

Abstract

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (13)