Mitigating dimensionality effects with robust graph constructions for testing
Yejiong Zhu, Hao Chen
TL;DR
This work addresses dimensionality-induced hub effects in graph-based nonparametric two-sample testing and change-point detection by introducing a robust graph construction that penalizes high node degrees. The robust $K$-NNG (r$K$-NNG) augments the standard objective with a penalty on $\sum_i |G_i|^2$ and is optimized via a greedy procedure, yielding substantial power gains, especially for variance/scale differences. The authors extend the asymptotic theory to directed graphs, proving consistency and deriving conditions under which the GET statistic on the r$K$-NNG converges to a $\chi^2_2$ distribution under the permutation null, with practical guidance on selecting the tuning parameter $\lambda$. Through simulations and real-data analyses (brain MRI, T-cell gene expression, and NYC taxi travels), the robust graph demonstrably improves detection sensitivity and robustness across high-dimensional and non-Euclidean settings, while offering a principled path for parameter tuning and future online-change-point extensions.
Abstract
Dimensionality effects pose major challenges in high-dimensional and non-Euclidean data analysis. Graph-based two-sample tests and change-point detection are particularly attractive in this context, as they make minimal distributional assumptions and perform well across a wide range of scenarios. These methods rely on similarity graphs constructed from data, with $K$-nearest neighbor graphs and $K$-minimum spanning trees among the most effective and widely used. However, in high-dimensional and non-Euclidean regimes such graphs often produce hubs -- nodes with disproportionately high degrees -- to which graph-based methods are especially sensitive. To mitigate these dimensionality effects, we propose a robust graph construction that is far less prone to hub formation. Incorporating this construction substantially improves the power of graph-based methods across diverse settings. We further establish a theoretical foundation by proving its consistency under fixed alternatives in both low- and high-dimensional regimes. The effectiveness of the approach is demonstrated through real-world applications, including comparisons of correlation matrices for brain regions, gene expression profiles of T cells, and temporal changes in New York City taxi travel patterns.
