Table of Contents
Fetching ...

This Time is Different: An Observability Perspective on Time Series Foundation Models

Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, Zongzhe Xu, Viktoriya Zhukova, David Asker, Ameet Talwalkar, Othmane Abou-Amal

TL;DR

This work develops Toto, a decoder-only time-series foundation model engineered for zero-shot forecasting on observability data, featuring per-patch causal normalization, time-variate attention, and a robust Student-T mixture output. It is trained on a colossal mixed corpus including Datadog telemetry, public datasets, and synthetic data, and is evaluated against a new large-scale observability benchmark, Boom, as well as established multi-domain benchmarks. Toto achieves state-of-the-art results across Boom, GIFT-Eval, and LSF, with notable improvements in probabilistic calibration and robustness to heavy-tailed distributions. By open-sourcing both Toto and Boom under Apache 2.0, the authors aim to accelerate practical adoption and further research in scalable, domain-specific time-series forecasting for observability workloads.

Abstract

We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto's pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10$\times$ larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog's own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto's model weights, inference code, and evaluation scripts, as well as BOOM's data and evaluation code, are all available as open source under the Apache 2.0 License available at https://huggingface.co/Datadog/Toto-Open-Base-1.0 and https://github.com/DataDog/toto.

This Time is Different: An Observability Perspective on Time Series Foundation Models

TL;DR

This work develops Toto, a decoder-only time-series foundation model engineered for zero-shot forecasting on observability data, featuring per-patch causal normalization, time-variate attention, and a robust Student-T mixture output. It is trained on a colossal mixed corpus including Datadog telemetry, public datasets, and synthetic data, and is evaluated against a new large-scale observability benchmark, Boom, as well as established multi-domain benchmarks. Toto achieves state-of-the-art results across Boom, GIFT-Eval, and LSF, with notable improvements in probabilistic calibration and robustness to heavy-tailed distributions. By open-sourcing both Toto and Boom under Apache 2.0, the authors aim to accelerate practical adoption and further research in scalable, domain-specific time-series forecasting for observability workloads.

Abstract

We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto's pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10 larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog's own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto's model weights, inference code, and evaluation scripts, as well as BOOM's data and evaluation code, are all available as open source under the Apache 2.0 License available at https://huggingface.co/Datadog/Toto-Open-Base-1.0 and https://github.com/DataDog/toto.

Paper Structure

This paper contains 47 sections, 20 equations, 12 figures, 19 tables.

Figures (12)

  • Figure 1: AToto is a zero-shot time series forecasting model trained on a mixture of observability data, open datasets, and synthetic data. To predict, context time series points are passed through a patch embedding, processed via proportional factorized variate-time attention layers, and projected to a probabilistic output via a learned Student-T Mixture model. We sample from this distribution to produce a prediction forecast. Note that Toto's novel architectural components are highlighted in purple. B 2D PCA projections of statistical features (described in Section \ref{['sec:boom-stats']}) of GIFT-Eval aksu2024giftevalbenchmarkgeneraltime, LSF Wu2021, and Boom highlight a clear distinction in the underlying time series characteristics of Boom relative to general-purpose time series benchmarks. C, DToto is the top performing model on Boom, the GIFT-Eval public leaderboard gifteval_leaderboard, and on LSF (see Table \ref{['tab:lsf_zero_shot_full']}).
  • Figure 2: Overview of the Toto architecture, highlighting our novel components in bold. A Multivariate input time series of $L$ steps are scaled using causal patch-based instance normalization, transformed into patch embeddings, and passed through a decoder-only transformer stack. The transformed features are unembedded and passed through a Student-T mixture model (optimized via a composite robust loss) which generates probabilistic next-patch predictions. B The patch embedding takes as input a time series of $M$ variates by $L$ time steps. It divides the time dimension into patches of size $P = 64$ and projects these linearly into an embedding space of latent dimension $D = 768$. This results in an output of size $M \times \frac{L}{P} \times D$ which is fed to the transformer decoder. C The transformer stack features proportional factorized attention. It contains $F = 1$ identical segment(s), with $N = 11$ time-wise transformer blocks followed by one variate-wise block.
  • Figure 3: A A comparison of the number unique time series points within the pretraining corpora of different time series foundation models. The scale of Toto's training corpus is $4\times$ that of TimesFM 1.0, $5\times$ that of Time-MoE, $6.5\times$ that of Moirai, and over $10\times$ that of Chronos. B Ablation results demonstrate the impact of four of Toto's architectural components motivated by unique properties of observability time series data. Results report the change (relative to the full Toto model) in negative log likelihood on held-out observability pretraining data when systematically disabling one component at a time. See Appendix \ref{['ablations']} for details.
  • Figure 4: ABoom consists of data from various domains represented within the Datadog platform. B Example series from three of the domains. From left to right, these series represent: sum of failed requests on a backend API, grouped by error type and source (Application); CPU limits on a multi-tenant service deployed on a Kubernetes cluster, grouped by tenant (Infrastructure); and sum of command executions on a Redis cache, grouped by command (Database).
  • Figure 5: Distributional comparison of 6 statistical features computed on normalized time series from the Boom GIFT-Eval, and LSF benchmark datasets. The broader and shifted distributions in the Boom series reflect the increased diversity, irregularity, and nonstationarity characteristic of observability data.
  • ...and 7 more figures