A system identification approach to clustering vector autoregressive time series
Zuogong Yue, Xinyi Wang, Victor Solo
TL;DR
This work addresses clustering of high-dimensional, vector-valued time series by their underlying dynamics rather than conventional similarity metrics. It develops a system-identification approach based on mixture vector autoregressions (MVAR), first proposing a soft-clustering EM algorithm (cMVAR) and then a scalable hard-clustering variant (k-LMVAR) derived from a small-noise limit, which alleviates numerical underflow and reduces computation. An Extended BIC framework is introduced to jointly select the number of clusters $K$ and the per-cluster model orders $(p_1,\dots,p_K)$, with a surrogate log-likelihood suitable for the k-LMVAR setting. Empirical results on synthetic data demonstrate that k-LMVAR achieves superior clustering accuracy and much better scalability than existing methods, including under high dimensionality, long time series, and many clusters. The proposed dynamics-based clustering has practical implications for interpretable modelling of complex systems and scalable analysis of large multivariate time-series datasets.
Abstract
Clustering of time series based on their underlying dynamics is keeping attracting researchers due to its impacts on assisting complex system modelling. Most current time series clustering methods handle only scalar time series, treat them as white noise, or rely on domain knowledge for high-quality feature construction, where the autocorrelation pattern/feature is mostly ignored. Instead of relying on heuristic feature/metric construction, the system identification approach allows treating vector time series clustering by explicitly considering their underlying autoregressive dynamics. We first derive a clustering algorithm based on a mixture autoregressive model. Unfortunately it turns out to have significant computational problems. We then derive a `small-noise' limiting version of the algorithm, which we call k-LMVAR (Limiting Mixture Vector AutoRegression), that is computationally manageable. We develop an associated BIC criterion for choosing the number of clusters and model order. The algorithm performs very well in comparative simulations and also scales well computationally.
