Table of Contents
Fetching ...

The UCR Time Series Archive

Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Eamonn Keogh

TL;DR

The paper tackles the reliability of time series benchmarking using the UCR Archive by detailing improvements to the archive (128 datasets) and proposing rigorous evaluation guidelines. It documents baseline 1-NN and DTW evaluation practices, critiques common pitfalls like cherry-picking and single-split benchmarks, and provides concrete recommendations for fair comparisons and reproducible research. Key contributions include a critical examination of evaluation practices, a cautionary tale on misattributing gains, and a thorough update to the archive with expanded datasets (including GunPoint, GesturePebble, EthanolLevel, InternalBleeding, and Freezer collections) to support robust, real-world benchmarking. The work emphasizes transparency, standardized reporting, and statistical rigor to enhance the practical impact of time series classification research.

Abstract

The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a large fraction may be mis-attributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a much simpler modification, requiring just a single line of code.

The UCR Time Series Archive

TL;DR

The paper tackles the reliability of time series benchmarking using the UCR Archive by detailing improvements to the archive (128 datasets) and proposing rigorous evaluation guidelines. It documents baseline 1-NN and DTW evaluation practices, critiques common pitfalls like cherry-picking and single-split benchmarks, and provides concrete recommendations for fair comparisons and reproducible research. Key contributions include a critical examination of evaluation practices, a cautionary tale on misattributing gains, and a thorough update to the archive with expanded datasets (including GunPoint, GesturePebble, EthanolLevel, InternalBleeding, and Freezer collections) to support robust, real-world benchmarking. The work emphasizes transparency, standardized reporting, and statistical rigor to enhance the practical impact of time series classification research.

Abstract

The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a large fraction may be mis-attributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a much simpler modification, requiring just a single line of code.

Paper Structure

This paper contains 27 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Visualization of the warping path. top) Euclidean distance with one-to-one point matching. The warping path is strictly diagonal (cannot visit the grayed-out cells). bottom) unconstrained DTW with one-to-many point matching. The warping path can monotonically advance through any cell of the distance matrix.
  • Figure 2: blue/fine) The leave-one-out error rate for increasing values of warping window $w$, using DTW-based 1-nearest neighbor classifier. red/bold) The holdout error rate. In the bottom-row examples, the holdout accuracies do not track the predicted accuracies.
  • Figure 3: left) The error rate on classification of the CBF data set for increasing amounts of smoothing using MATLAB's default smoothing algorithm. right) The error rate on classification of the FordB data set for increasing number of nearest neighbors. Note that the leave-one-out error rate on the training data does approximately predict the best parameter to use.
  • Figure 4: Critical difference for MPdist distance against four benchmark distances. Figure credited to Gharghabi et al. Gharghabi2018. We can summarize this diagram as follow: RotF is the best performing algorithm with an average rank of 2.2824; there is an overall significant difference among the five algorithms; there are three distinct cliques; MPdist is significantly better than ED distance and not significantly worse than the rest.
  • Figure 5: Comparison of Euclidean distance versus constrained DTW for 128 data sets. In the Texas Sharpshooter plot, each data set falls into one of four possibilities corresponding to four quadrants. We optimize the performance of DTW by learning a suitable warping window width and compare the expected improvement with the actual improvement. The results are strongly supportive of the claim that DTW is better than Euclidean distance for most problems. Note that some of the numbers are hard to read because they overlap.
  • ...and 7 more figures