Table of Contents
Fetching ...

A system identification approach to clustering vector autoregressive time series

Zuogong Yue, Xinyi Wang, Victor Solo

TL;DR

This work addresses clustering of high-dimensional, vector-valued time series by their underlying dynamics rather than conventional similarity metrics. It develops a system-identification approach based on mixture vector autoregressions (MVAR), first proposing a soft-clustering EM algorithm (cMVAR) and then a scalable hard-clustering variant (k-LMVAR) derived from a small-noise limit, which alleviates numerical underflow and reduces computation. An Extended BIC framework is introduced to jointly select the number of clusters $K$ and the per-cluster model orders $(p_1,\dots,p_K)$, with a surrogate log-likelihood suitable for the k-LMVAR setting. Empirical results on synthetic data demonstrate that k-LMVAR achieves superior clustering accuracy and much better scalability than existing methods, including under high dimensionality, long time series, and many clusters. The proposed dynamics-based clustering has practical implications for interpretable modelling of complex systems and scalable analysis of large multivariate time-series datasets.

Abstract

Clustering of time series based on their underlying dynamics is keeping attracting researchers due to its impacts on assisting complex system modelling. Most current time series clustering methods handle only scalar time series, treat them as white noise, or rely on domain knowledge for high-quality feature construction, where the autocorrelation pattern/feature is mostly ignored. Instead of relying on heuristic feature/metric construction, the system identification approach allows treating vector time series clustering by explicitly considering their underlying autoregressive dynamics. We first derive a clustering algorithm based on a mixture autoregressive model. Unfortunately it turns out to have significant computational problems. We then derive a `small-noise' limiting version of the algorithm, which we call k-LMVAR (Limiting Mixture Vector AutoRegression), that is computationally manageable. We develop an associated BIC criterion for choosing the number of clusters and model order. The algorithm performs very well in comparative simulations and also scales well computationally.

A system identification approach to clustering vector autoregressive time series

TL;DR

This work addresses clustering of high-dimensional, vector-valued time series by their underlying dynamics rather than conventional similarity metrics. It develops a system-identification approach based on mixture vector autoregressions (MVAR), first proposing a soft-clustering EM algorithm (cMVAR) and then a scalable hard-clustering variant (k-LMVAR) derived from a small-noise limit, which alleviates numerical underflow and reduces computation. An Extended BIC framework is introduced to jointly select the number of clusters and the per-cluster model orders , with a surrogate log-likelihood suitable for the k-LMVAR setting. Empirical results on synthetic data demonstrate that k-LMVAR achieves superior clustering accuracy and much better scalability than existing methods, including under high dimensionality, long time series, and many clusters. The proposed dynamics-based clustering has practical implications for interpretable modelling of complex systems and scalable analysis of large multivariate time-series datasets.

Abstract

Clustering of time series based on their underlying dynamics is keeping attracting researchers due to its impacts on assisting complex system modelling. Most current time series clustering methods handle only scalar time series, treat them as white noise, or rely on domain knowledge for high-quality feature construction, where the autocorrelation pattern/feature is mostly ignored. Instead of relying on heuristic feature/metric construction, the system identification approach allows treating vector time series clustering by explicitly considering their underlying autoregressive dynamics. We first derive a clustering algorithm based on a mixture autoregressive model. Unfortunately it turns out to have significant computational problems. We then derive a `small-noise' limiting version of the algorithm, which we call k-LMVAR (Limiting Mixture Vector AutoRegression), that is computationally manageable. We develop an associated BIC criterion for choosing the number of clusters and model order. The algorithm performs very well in comparative simulations and also scales well computationally.

Paper Structure

This paper contains 34 sections, 41 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Benchmark of clustering performance of cMVAR, k-LMVAR against the state-of-the-art, using Rand Index (RI) and Normalised Mutual Information (NMI).
  • Figure 2: Benchmark of computation time for k-LMVAR and cMVAR algorithms over different numbers of clusters. The curves boxed by dashed lines are zoomed in and shown in the floated sub-figure. The error bar illustrates one standard deviation.
  • Figure 3: Comparative study on scalability of k-LMVAR and cMVAR. The missing data points or points marked by red circles indicate the failures of cMVAR happening at the current problem setup. The missing of points of cMVAR in curves indicates the 100% failure in the bar plots. The bar plots show the percentage of experiments that cMVAR fails due to its numerical issues.
  • Figure 4: Mesh surface plot of BIC scores in terms of the number of clusters $K$ and model orders $p$.
  • Figure 5: Clustering performance (using NMI) of the naive two-step method that applies k-means on the VAR model parameters estimated for each time series, given the different length of time series.