Automated regime detection in multidimensional time series data using sliced Wasserstein k-means clustering

Qinmeng Luan; James Hamp

Automated regime detection in multidimensional time series data using sliced Wasserstein k-means clustering

Qinmeng Luan, James Hamp

TL;DR

The paper tackles automatic regime detection in multidimensional time series by extending Wasserstein k-means to a sliced Wasserstein framework (sWk-means). It introduces a data-lifting scheme to convert time-series windows into empirical measures and uses fixed projection directions to compute a tractable distance and centroid update, enabling scalable clustering in 2D and 3D. Through extensive synthetic experiments and a real FX case study, it demonstrates that sWk-means can reliably identify distinct regimes even when means and covariances are similar, and it provides practical metrics to gauge clustering quality. The approach offers a computationally efficient, unsupervised regime-detection method with clear applicability to financial time series and beyond, while highlighting limitations such as the curse of dimensionality and potential alternatives like HMMs or rough-path methods.

Abstract

Recent work has proposed Wasserstein k-means (Wk-means) clustering as a powerful method to identify regimes in time series data, and one-dimensional asset returns in particular. In this paper, we begin by studying in detail the behaviour of the Wasserstein k-means clustering algorithm applied to synthetic one-dimensional time series data. We study the dynamics of the algorithm and investigate how varying different hyperparameters impacts the performance of the clustering algorithm for different random initialisations. We compute simple metrics that we find are useful in identifying high-quality clusterings. Then, we extend the technique of Wasserstein k-means clustering to multidimensional time series data by approximating the multidimensional Wasserstein distance as a sliced Wasserstein distance, resulting in a method we call `sliced Wasserstein k-means (sWk-means) clustering'. We apply the sWk-means clustering method to the problem of automated regime detection in multidimensional time series data, using synthetic data to demonstrate the validity of the approach. Finally, we show that the sWk-means method is effective in identifying distinct market regimes in real multidimensional financial time series, using publicly available foreign exchange spot rate data as a case study. We conclude with remarks about some limitations of our approach and potential complementary or alternative approaches.

Automated regime detection in multidimensional time series data using sliced Wasserstein k-means clustering

TL;DR

Abstract

Paper Structure (26 sections, 27 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 26 sections, 27 equations, 11 figures, 3 tables, 2 algorithms.

Introduction
Methodology
Data streams and empirical distributions
Wasserstein metric
Wasserstein distance $\mathcal{W}_p$
Wasserstein barycentre $\bar{\mu}^{\mathcal{W}_p}$
$d=1$
$d>1$
Sliced Wasserstein distance $\overline{\mathcal{W}}_p$
Sliced Wasserstein barycentre $\bar{\mu}^{\overline{\mathcal{W}}_p}$
sWk-means method
Results
1d time series data: Dynamics and performance of the Wk-means algorithm
1d synthetic data generation method
Clustering example
...and 11 more sections

Figures (11)

Figure 1: Synthetic 1d data containing two regimes. (a) The time series $S(t)$, with the majority regimes (I) corresponding to 'bullish' parameters $\Theta_\mathrm{bull}$ and minority regimes (II) corresponding to 'bearish' parameters $\Theta_\mathrm{bear}$ indicated. (b) The corresponding log returns $r^S$. There are ${20 \times 252 \times 7 = 35,280}$ data points.
Figure 2: Results of the Wk-means clustering algorithm applied to the synthetic 1d data shown in Figure \ref{['fig:ts1-synthetic-figure']}. (a) Clustering results for the time series $S(t)$. Each point in the time series is coloured according to its assigned cluster. (b) Clustering results for the distributions $\mu_m \in \mathcal{K}$ in mean-variance ($\text{Var}(\mu_m)$-$\mathbb{E}(\mu_m)$) space. Each point is coloured according to its assigned cluster. The window size is $h_1 = 35$ and the lifting size is $h_2 = 7 \, (20\%)$.
Figure 3: Dynamics of the Wk-means clustering algorithm applied to the synthetic 1d data shown in Figure \ref{['fig:ts1-synthetic-figure']}. (a) Mean squared point-centroid distance $\langle \mathcal{W}_p(\mu_i, \bar{\mu}_k)^2 \rangle_{k,i\in\mathcal{C}_k}$ and (b) Mean centroid-centroid distance $\langle \mathcal{W}_p(\bar{\mu}_k, \bar{\mu}_{k'}) \rangle_{k,k'}$ as a function of algorithm iteration for different random initialisations. The paths are coloured according to the instantaneous total accuracy $\mathrm{TA}(\mathcal{C})$ computed during the evolution of the algorithm (see colourbar). The two metrics are effective in differentiating between high- and low-accuracy clusterings. The window size is $h_1 = 30$ and the lifting size is $h_2 = 9 \, (30\%)$.
Figure 4: Dependence of the average accuracy score $\overline{\mathrm{TA}}$ (median) computed from $N_c = 1,000$ clustering runs on the window and lifting size ($h_1, h_2$), using different amounts of synthetic 1d data. The average accuracy $\overline{\mathrm{TA}}$ generally increases with decreasing $h_2$ due to the data augmentation effect associated with decreasing $h_2$. This effect is particularly pronounced for smaller datasets (2 years, 1 year). The average accuracy also generally increases with $h_1$.
Figure 5: Synthetic 2d time series data with two regimes. (a), (c) The time series $S(t)$, with majority (I) and minority (II) regimes indicated. (b), (d) The empirical distributions of log returns $r^S$ corresponding to (a), (c) respectively. There are $20 \times 252 \times 7 = 35,280$ data points. The data in (a), (b) has regime I corresponding to 'bullish' parameters $\Theta_\mathrm{bull}$, regime II corresponding to 'bearish' parameters $\Theta_\mathrm{bear}$, and $\rho = +1/2$ for both regimes. The data in (c), (d) has regime I and II both corresponding to 'bullish' parameters $\Theta_\mathrm{bull}$, but regime I having $\rho = +1/2$ and regime II having $\rho = -1/2$. The light-coloured points in the distributions in (b), (d) correspond to the majority regime (I) periods with no highlighting in (a), (c); the orange points correspond to the minority regime (II) periods highlighted in orange.
...and 6 more figures

Automated regime detection in multidimensional time series data using sliced Wasserstein k-means clustering

TL;DR

Abstract

Automated regime detection in multidimensional time series data using sliced Wasserstein k-means clustering

Authors

TL;DR

Abstract

Table of Contents

Figures (11)