Quantitative Evaluation of Motif Sets in Time Series
Daan Van Wesenbeeck, Aras Yurtman, Wannes Meert, Hendrik Blockeel
TL;DR
This work tackles the lack of broadly applicable quantitative evaluation in time series motif discovery by introducing PROM, a precision–recall metric under optimal matching, and TSMD-Bench, a benchmark built from real data-derived GT motif sets. PROM matches discovered motifs to ground-truth motifs via overlap-based criteria, optimizes motif-set alignment with the Hungarian method, and reports micro-averaged precision, recall, and F1, avoiding restrictive assumptions about motif length or the number of motif sets. TSMD-Bench constructs 14 benchmark datasets from classification archives, using a principled concatenation scheme and ARI-guided dataset selection to yield realistic TSMD tasks; it also provides validation/test splits to support hyperparameter tuning. Experiments show PROM offers a more balanced evaluation than existing metrics, LoCoMotif often achieves the best F1, and random-walk–based benchmarks are too easy, highlighting the value of the proposed benchmark for fair, large-scale TSMD comparisons.
Abstract
Time Series Motif Discovery (TSMD), which aims at finding recurring patterns in time series, is an important task in numerous application domains, and many methods for this task exist. These methods are usually evaluated qualitatively. A few metrics for quantitative evaluation, where discovered motifs are compared to some ground truth, have been proposed, but they typically make implicit assumptions that limit their applicability. This paper introduces PROM, a broadly applicable metric that overcomes those limitations, and TSMD-Bench, a benchmark for quantitative evaluation of time series motif discovery. Experiments with PROM and TSMD-Bench show that PROM provides a more comprehensive evaluation than existing metrics, that TSMD-Bench is a more challenging benchmark than earlier ones, and that the combination can help understand the relative performance of TSMD methods. More generally, the proposed approach enables large-scale, systematic performance comparisons in this field.
