Change Point Inference for Non-Euclidean Data Sequences using Distance Profiles

Paromita Dubey; Minxing Zheng

Change Point Inference for Non-Euclidean Data Sequences using Distance Profiles

Paromita Dubey, Minxing Zheng

TL;DR

This work develops a tuning-parameter-free, nonparametric change-point detector for sequences of random objects valued in general metric spaces using distance profiles. The core tool is a scan statistic $\hat{T}_n(u)$ built from differences in empirical distance profiles across potential split points, with asymptotic null distribution derived and inference performed via a permutation scheme. The authors establish consistency under fixed and local alternatives and near-optimal localization rates for the change-point estimator, and extend to multiple change points with seeded binary segmentation. Extensive simulations across multivariate, distributional, and network data, along with real-data applications to US electricity generation and MIT Bluetooth networks, demonstrate strong practical performance and broad applicability in non-Euclidean settings.

Abstract

We introduce a powerful scan statistic and the corresponding test for detecting the presence and pinpointing the location of a change point within the distribution of a data sequence with the data elements residing in a separable metric space $(Ω, d)$. These change points mark abrupt shifts in the distribution of the data sequence as characterized using distance profiles, where the distance profile of an element $ω\in Ω$ is the distribution of distances from $ω$ as dictated by the data. This approach is tuning parameter free, fully non-parametric and universally applicable to diverse data types, including distributional and network data, as long as distances between the data objects are available. We obtain an explicit characterization of the asymptotic distribution of the test statistic under the null hypothesis of no change points, rigorous guarantees on the consistency of the test in the presence of change points under fixed and local alternatives and near-optimal convergence of the estimated change point location, all under practicable settings. To compare with state-of-the-art methods we conduct simulations covering multivariate data, bivariate distributional data and sequences of graph Laplacians, and illustrate our method on real data sequences of the U.S. electricity generation compositions and Bluetooth proximity networks.

Change Point Inference for Non-Euclidean Data Sequences using Distance Profiles

TL;DR

built from differences in empirical distance profiles across potential split points, with asymptotic null distribution derived and inference performed via a permutation scheme. The authors establish consistency under fixed and local alternatives and near-optimal localization rates for the change-point estimator, and extend to multiple change points with seeded binary segmentation. Extensive simulations across multivariate, distributional, and network data, along with real-data applications to US electricity generation and MIT Bluetooth networks, demonstrate strong practical performance and broad applicability in non-Euclidean settings.

Abstract

. These change points mark abrupt shifts in the distribution of the data sequence as characterized using distance profiles, where the distance profile of an element

is the distribution of distances from

as dictated by the data. This approach is tuning parameter free, fully non-parametric and universally applicable to diverse data types, including distributional and network data, as long as distances between the data objects are available. We obtain an explicit characterization of the asymptotic distribution of the test statistic under the null hypothesis of no change points, rigorous guarantees on the consistency of the test in the presence of change points under fixed and local alternatives and near-optimal convergence of the estimated change point location, all under practicable settings. To compare with state-of-the-art methods we conduct simulations covering multivariate data, bivariate distributional data and sequences of graph Laplacians, and illustrate our method on real data sequences of the U.S. electricity generation compositions and Bluetooth proximity networks.

Paper Structure (18 sections, 3 theorems, 22 equations, 10 figures, 1 algorithm)

This paper contains 18 sections, 3 theorems, 22 equations, 10 figures, 1 algorithm.

Introduction
Methodology
Distance profiles of random objects
Change point detection problem
Scan statistic and type I error control
Power analysis under local alternatives
Rates of convergence of the estimated change point
Simulations
Multivariate data
Bivariate distributional data
Network data
Data applications
U.S. electricity generation dataset
MIT reality mining dataset
Multiple change points
...and 3 more sections

Key Result

Theorem 1

Under $H_0$ and assumptions ass:dpfctn and ass:entropy, as $n \rightarrow \infty$, $\hat{T}_n$ converges in distribution to the law of a random variable $\mathcal{T}=\sup_{u \in \mathcal{I}_c} \sum_{j=1}^\infty \mathbb{E}_Y\{\lambda_j^Y\}\mathcal{G}_j^2(u)$, where $Y \sim P_1$, $\lambda^x_1 \geq \la

Figures (10)

Figure 1: The distance profiles of $Y_{200}$ in the sequence of observations $Y_i, \ i=1,\dots,300$, where $Y_i \sim N(0,1),\ i=1,\dots,100$ and $Y_i \sim N(2,1),\ i=101,\dots,300$ with respect to $Y_1,\dots,Y_{[nu]}$ and $Y_{[nu]+1},\dots,Y_{n}$ at different scan points $u=\frac{1}{6}, \frac{1}{3} \text{(the change point)}, \frac{1}{2}, \frac{3}{4}$.
Figure 2: In Figure \ref{['fig:normal_mean_power']}, we present the power comparisons with respect to $\Delta_1$ for a sequence of $p$-dimensional random vectors sampled from $N(\mu,\Sigma)$, where $\mu=\mathbf{0}_p=(0,0,\dots,0)^T$ for $Y_i, \ i=1,\dots,100$ and $\mu=\Delta_1 \mathbf{1}_p=\Delta_1(1,1,\dots,1)^T$ for $Y_i, \ i=101,\dots,300$. $\Sigma$ is held fixed for the whole sequence as $\Sigma=U\Lambda U^T$, where $\Lambda$ is a diagonal matrix with $k$th diagonal entry being $\textup{cos}(k\pi/p)+1.5$ for $k=1,\dots,p$, and $U$ is an orthogonal matrix with the first columns being $p^{-1/2}(1,1,\dots,1)^T$. The dotted black line indicates the significance level of 0.05. Figure \ref{['fig:normal_mean_loc']} presents the MAE of the estimated change points with respect to $\Delta_1$.
Figure 3: In Figure \ref{['fig:normal_scale_power']}, we present the power comparisons with respect to $\Delta_2$ for a sequence of $p$-dimensional random vectors sampled from $N(\mu,\Sigma)$, where $\mu=\mathbf{0}_p$ for the whole sequence. $\Sigma=0.8 \mathbf{I}_p$ for $Y_i, \ i=1,\dots,100$ and $\Sigma=(0.8-\Delta_2)\mathbf{I}_p$ for $Y_i, \ i=101,\dots,300$. The dotted black line indicates the significance level of 0.05. Figure \ref{['fig:normal_scale_loc']} presents the MAE of the estimated change points with respect to $\Delta_2$.
Figure 4: In Figure \ref{['fig:normal_mix_power']}, we present the power comparisons with respect to $\Delta_3$ for a sequence of $p$ dimensional random vectors. Here, $Y_1,\dots,Y_{100}$ are generated from the standard $p$ dimensional Gaussian distribution $N(\mathbf{0}_p,\mathbf{I}_p)$. $Y_{101},\dots,Y_{300}$ are constructed with independent samples of $AZ_1+(1-A)Z_2$, where $A \sim \textup{Bernoulli(0.5)}$, $Z_1 \sim N(-\mu,\mathbf{I}_p)$, $Z_2 \sim N(\mu,\mathbf{I}_p)$, where $\mu = (\Delta_3 \mathbf{1}_{0.1p},\mathbf{0}_{0.9p})^T$, and $A$, $Z_1$, and $Z_2$ are independent. The dotted black line indicates the significance level of 0.05. Figure \ref{['fig:normal_mix_loc']} presents the MAE of the estimated change points with respect to $\Delta_3$
Figure 5: In Figure \ref{['fig:normal_tail_power']}, we present the power comparison for increasing values of $v$ for a sequence of $p$ dimensional random vectors. Here, $Y_i \sim N(\mathbf{0}_p,\mathbf{I}_p)$ with $p \in \{5,15,60\}$ for $i=1,\dots,100$ and $Y_i \sim t_v$ for $i=101,\dots,300$, where $t_v$ stands for $t$ distribution with $v$ degrees of freedom. The dotted black line indicates the significance level of 0.05. Figure \ref{['fig:normal_tail_loc']} presents the MAE of the estimated change points with respect to $v$.
...and 5 more figures

Theorems & Definitions (3)

Theorem 1
Theorem 2
Theorem 3

Change Point Inference for Non-Euclidean Data Sequences using Distance Profiles

TL;DR

Abstract

Change Point Inference for Non-Euclidean Data Sequences using Distance Profiles

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (3)