Change Point Inference for Non-Euclidean Data Sequences using Distance Profiles
Paromita Dubey, Minxing Zheng
TL;DR
This work develops a tuning-parameter-free, nonparametric change-point detector for sequences of random objects valued in general metric spaces using distance profiles. The core tool is a scan statistic $\hat{T}_n(u)$ built from differences in empirical distance profiles across potential split points, with asymptotic null distribution derived and inference performed via a permutation scheme. The authors establish consistency under fixed and local alternatives and near-optimal localization rates for the change-point estimator, and extend to multiple change points with seeded binary segmentation. Extensive simulations across multivariate, distributional, and network data, along with real-data applications to US electricity generation and MIT Bluetooth networks, demonstrate strong practical performance and broad applicability in non-Euclidean settings.
Abstract
We introduce a powerful scan statistic and the corresponding test for detecting the presence and pinpointing the location of a change point within the distribution of a data sequence with the data elements residing in a separable metric space $(Ω, d)$. These change points mark abrupt shifts in the distribution of the data sequence as characterized using distance profiles, where the distance profile of an element $ω\in Ω$ is the distribution of distances from $ω$ as dictated by the data. This approach is tuning parameter free, fully non-parametric and universally applicable to diverse data types, including distributional and network data, as long as distances between the data objects are available. We obtain an explicit characterization of the asymptotic distribution of the test statistic under the null hypothesis of no change points, rigorous guarantees on the consistency of the test in the presence of change points under fixed and local alternatives and near-optimal convergence of the estimated change point location, all under practicable settings. To compare with state-of-the-art methods we conduct simulations covering multivariate data, bivariate distributional data and sequences of graph Laplacians, and illustrate our method on real data sequences of the U.S. electricity generation compositions and Bluetooth proximity networks.
