Table of Contents
Fetching ...

STEB: In Search of the Best Evaluation Approach for Synthetic Time Series

Michael Stenger, Robert Leppich, André Bauer, Samuel Kounev

TL;DR

STEB tackles the lack of standardized evaluation for synthetic time series by introducing a benchmark that stress-tests 41 quantitative TS evaluation measures across 10 diverse datasets using 13 transformations along a modulation path controlled by $κ$. It introduces reliability ($r_\text{rel}$) and consistency ($r_\text{con}$) indicators to quantify how well measures track true quality and remain stable across seeds and datasets, while recording running times and embedding dependencies. The study ranks measures, reveals that the upstream TS embedding choice can substantially alter scores, and demonstrates the need for embedder standardization in measure evaluation. With plans to open-source STEB, the work lays groundwork for more objective, scalable, and interpretable comparisons of synthetic TS quality across domains and applications.

Abstract

The growing need for synthetic time series, due to data augmentation or privacy regulations, has led to numerous generative models, frameworks, and evaluation measures alike. Objectively comparing these measures on a large scale remains an open challenge. We propose the Synthetic Time series Evaluation Benchmark (STEB) -- the first benchmark framework that enables comprehensive and interpretable automated comparisons of synthetic time series evaluation measures. Using 10 diverse datasets, randomness injection, and 13 configurable data transformations, STEB computes indicators for measure reliability and score consistency. It tracks running time, test errors, and features sequential and parallel modes of operation. In our experiments, we determine a ranking of 41 measures from literature and confirm that the choice of upstream time series embedding heavily impacts the final score.

STEB: In Search of the Best Evaluation Approach for Synthetic Time Series

TL;DR

STEB tackles the lack of standardized evaluation for synthetic time series by introducing a benchmark that stress-tests 41 quantitative TS evaluation measures across 10 diverse datasets using 13 transformations along a modulation path controlled by . It introduces reliability () and consistency () indicators to quantify how well measures track true quality and remain stable across seeds and datasets, while recording running times and embedding dependencies. The study ranks measures, reveals that the upstream TS embedding choice can substantially alter scores, and demonstrates the need for embedder standardization in measure evaluation. With plans to open-source STEB, the work lays groundwork for more objective, scalable, and interpretable comparisons of synthetic TS quality across domains and applications.

Abstract

The growing need for synthetic time series, due to data augmentation or privacy regulations, has led to numerous generative models, frameworks, and evaluation measures alike. Objectively comparing these measures on a large scale remains an open challenge. We propose the Synthetic Time series Evaluation Benchmark (STEB) -- the first benchmark framework that enables comprehensive and interpretable automated comparisons of synthetic time series evaluation measures. Using 10 diverse datasets, randomness injection, and 13 configurable data transformations, STEB computes indicators for measure reliability and score consistency. It tracks running time, test errors, and features sequential and parallel modes of operation. In our experiments, we determine a ranking of 41 measures from literature and confirm that the choice of upstream time series embedding heavily impacts the final score.

Paper Structure

This paper contains 80 sections, 23 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Depiction of the modulation concept. By modulating parameter $\kappa$, we can influence the degree to which a transformation $T$ impacts dataset $D_r$ (resp. its underlying distribution $P$) to create the pseudo-synthetic dataset $D_T$ with distribution $P_T$. For $\kappa = 0$, we get $P_T^0 = P$ (black), for $\kappa = 0.3$, it might be $P_T^{1}$ (blue, dashed), and for $\kappa = 0.9$, it is $P_T^{2}$ (red, dotted).
  • Figure 2: Architectural design of STEB. Input (top left) and output (bottom) are highlighted in orange, STEB components in blue. Datasets are referenced as $D$ and the flow of data is indicated by arrows with filled head. The open-headed arrows mark information flow such as scores, rankings, measurements, and error messages. Dashed arrows denote conditional data flow, where $D_\text{held\_out}$ depends on the measure and $D_\text{rs}$ on the transformation.
  • Figure 3: Critical difference diagram for reliability indicator $r_\text{rel}$ in category fidelity as part of Main. The horizontal axis at the top depicts $r_\text{rel}$. Additional horizontal bars connect groups of measures with no significantly different $r_\text{rel}$ value.
  • Figure 4: Critical difference diagram for reliability indicator $r_\text{rel}$ in category Generalization as part of Main. The horizontal axis at the top depicts $r_\text{rel}$. Additional horizontal bars connect groups of measures with no significantly different $r_\text{rel}$ value.
  • Figure 5: Critical difference diagram for reliability indicator $r_\text{rel}$ in category privacy as part of Main. The horizontal axis at the top depicts $r_\text{rel}$. Additional horizontal bars connect groups of measures with no significantly different $r_\text{rel}$ value.
  • ...and 1 more figures