Table of Contents
Fetching ...

Bake off redux: a review and experimental evaluation of recent time series classification algorithms

Matthew Middlehurst, Patrick Schäfer, Anthony Bagnall

TL;DR

Bake-off redux surveys six years of TSC progress by evaluating post-bake-off algorithms across 112 UTSC datasets plus 30 new UTSC datasets, extending the original taxonomy to include convolution and deep-learning hybrids. The study demonstrates that MR-Hydra and HC2 consistently outperform prior bests, with QUANT offering exceptional speed and strong performance in large-scale problems. Deep learning lags behind optimized ensembles for UTSC, partly due to data scale and reproducibility issues, while dilation and ensemble diversity remain key drivers of performance. Overall, the work provides a comprehensive, reproducible benchmark and practical guidance on method selection, highlighting the ongoing value of hybrid and convolution-based approaches for time series classification.

Abstract

In 2017, a research paper compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a `bake off', identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, Hydra+MultiROCKET and HIVE-COTEv2, perform significantly better than other approaches on both the current and new TSC problems.

Bake off redux: a review and experimental evaluation of recent time series classification algorithms

TL;DR

Bake-off redux surveys six years of TSC progress by evaluating post-bake-off algorithms across 112 UTSC datasets plus 30 new UTSC datasets, extending the original taxonomy to include convolution and deep-learning hybrids. The study demonstrates that MR-Hydra and HC2 consistently outperform prior bests, with QUANT offering exceptional speed and strong performance in large-scale problems. Deep learning lags behind optimized ensembles for UTSC, partly due to data scale and reproducibility issues, while dilation and ensemble diversity remain key drivers of performance. Overall, the work provides a comprehensive, reproducible benchmark and practical guidance on method selection, highlighting the ongoing value of hybrid and convolution-based approaches for time series classification.

Abstract

In 2017, a research paper compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a `bake off', identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, Hydra+MultiROCKET and HIVE-COTEv2, perform significantly better than other approaches on both the current and new TSC problems.
Paper Structure (78 sections, 5 equations, 52 figures, 23 tables)

This paper contains 78 sections, 5 equations, 52 figures, 23 tables.

Figures (52)

  • Figure 1: The convolution operation as a sliding dot-product. The kernel $\omega=[-1,0,1]$ is convolved with the input series, producing an activation map. Max-pooling extracts the maximum from this activation map.
  • Figure 2: The $30$ new univariate datasets showing one representative series for each class.
  • Figure 3: The sorted original label values for all discretised regression datasets. Each point is a label for a case, and its colour is the class it is part of for the new classification version.
  • Figure 4: Comparison of distribution of the $30$ newly acquired to the existing $112$ UCR UTSC datasets across dimensions including length, train set size, number of classes, and data type.
  • Figure 5: An example of how DTW compensates for phase shift by realigning two series (in red at the bottom and in green at the top).
  • ...and 47 more figures

Theorems & Definitions (7)

  • Definition 1: Time Series (TS)
  • Definition 2: Multivariate Time Series (MTS)
  • Definition 3: Dataset
  • Definition 4: Subseries
  • Definition 5: Sliding Window
  • Definition 6: Convolution (cross-correlation)
  • Definition 7: Dilated Subseries