Toto: Time Series Optimized Transformer for Observability

Ben Cohen; Emaad Khwaja; Kan Wang; Charles Masson; Elise Ramé; Youssef Doubli; Othmane Abou-Amal

Toto: Time Series Optimized Transformer for Observability

Ben Cohen, Emaad Khwaja, Kan Wang, Charles Masson, Elise Ramé, Youssef Doubli, Othmane Abou-Amal

TL;DR

Toto introduces a decoder-only transformer tailored for observability time series, featuring a novel Proportional Factorized Space-Time Attention and a robust $k$-component Student-T mixture head for probabilistic forecasting. Pretrained on approximately $10^{12}$ time-series points, including $75\%$ anonymous Datadog metrics, Toto excels in zero-shot forecasting on standard LSF benchmarks and on a Datadog observability benchmark, while remaining competitive in full-shot settings. The model is designed to handle high-frequency, high-cardinality, nonstationary telemetry data, enabling accurate, scalable forecasts for real-time monitoring and incident response. These advances demonstrate the practical impact of large-scale, domain-specific foundation models for observability analytics, potentially reducing alert fatigue and speeding anomaly detection across complex systems.

Abstract

This technical report describes the Time Series Optimized Transformer for Observability (Toto), a new state of the art foundation model for time series forecasting developed by Datadog. In addition to advancing the state of the art on generalized time series benchmarks in domains such as electricity and weather, this model is the first general-purpose time series forecasting foundation model to be specifically tuned for observability metrics. Toto was trained on a dataset of one trillion time series data points, the largest among all currently published time series foundation models. Alongside publicly available time series datasets, 75% of the data used to train Toto consists of fully anonymous numerical metric data points from the Datadog platform. In our experiments, Toto outperforms existing time series foundation models on observability data. It does this while also excelling at general-purpose forecasting tasks, achieving state-of-the-art zero-shot performance on multiple open benchmark datasets.

Toto: Time Series Optimized Transformer for Observability

TL;DR

Toto introduces a decoder-only transformer tailored for observability time series, featuring a novel Proportional Factorized Space-Time Attention and a robust

-component Student-T mixture head for probabilistic forecasting. Pretrained on approximately

time-series points, including

anonymous Datadog metrics, Toto excels in zero-shot forecasting on standard LSF benchmarks and on a Datadog observability benchmark, while remaining competitive in full-shot settings. The model is designed to handle high-frequency, high-cardinality, nonstationary telemetry data, enabling accurate, scalable forecasts for real-time monitoring and incident response. These advances demonstrate the practical impact of large-scale, domain-specific foundation models for observability analytics, potentially reducing alert fatigue and speeding anomaly detection across complex systems.

Abstract

Paper Structure (27 sections, 3 figures, 5 tables)

This paper contains 27 sections, 3 figures, 5 tables.

Background
Observability data
Traditional models
Foundation models
Recent work
Problem statement
Model architecture
Transformer design
Input embedding
Attention mechanism
Probabilistic prediction head
Input/output scaling
Training objective
Hyperparameters
Training data
...and 12 more sections

Figures (3)

Figure 2: Example of Toto' s 96-step zero-shot forecasts on the ETTh1 dataset, showing multivariate probabilistic predictions. Solid lines represent ground truth, dashed lines represent median point forecasts, and shaded regions represent 95% prediction intervals.
Figure 3: The patch embedding takes as input a multivariate time series of $M$ variates by $N$ time steps. It divides each variate along the time dimension into patches of size $P$ and projects these linearly into an embedding space of latent dimension $D$. This results in an output of size $M \times \frac{N}{P} \times D$ which is fed to the transformer decoder.
Figure 4: Example metric query in the Datadog platform. The metric name (1) determines which metric is being queried. The filter clause (2) limits which contexts are queried, in this case restricting the query to the prod environment. The space aggregation (3) indicates that the average metric value should be returned for each unique combination of the group-by keys. The time aggregation (4) indicates that metric values should be aggregated to the average for each 60-second interval. The query results will be a multivariate time series with 1-minute time steps, and with separate individual variates for each unique service, datacenter tuple.

Toto: Time Series Optimized Transformer for Observability

TL;DR

Abstract

Toto: Time Series Optimized Transformer for Observability

Authors

TL;DR

Abstract

Table of Contents

Figures (3)