Mitigating dimensionality effects with robust graph constructions for testing

Yejiong Zhu; Hao Chen

Mitigating dimensionality effects with robust graph constructions for testing

Yejiong Zhu, Hao Chen

TL;DR

This work addresses dimensionality-induced hub effects in graph-based nonparametric two-sample testing and change-point detection by introducing a robust graph construction that penalizes high node degrees. The robust $K$-NNG (r$K$-NNG) augments the standard objective with a penalty on $\sum_i |G_i|^2$ and is optimized via a greedy procedure, yielding substantial power gains, especially for variance/scale differences. The authors extend the asymptotic theory to directed graphs, proving consistency and deriving conditions under which the GET statistic on the r$K$-NNG converges to a $\chi^2_2$ distribution under the permutation null, with practical guidance on selecting the tuning parameter $\lambda$. Through simulations and real-data analyses (brain MRI, T-cell gene expression, and NYC taxi travels), the robust graph demonstrably improves detection sensitivity and robustness across high-dimensional and non-Euclidean settings, while offering a principled path for parameter tuning and future online-change-point extensions.

Abstract

Dimensionality effects pose major challenges in high-dimensional and non-Euclidean data analysis. Graph-based two-sample tests and change-point detection are particularly attractive in this context, as they make minimal distributional assumptions and perform well across a wide range of scenarios. These methods rely on similarity graphs constructed from data, with $K$-nearest neighbor graphs and $K$-minimum spanning trees among the most effective and widely used. However, in high-dimensional and non-Euclidean regimes such graphs often produce hubs -- nodes with disproportionately high degrees -- to which graph-based methods are especially sensitive. To mitigate these dimensionality effects, we propose a robust graph construction that is far less prone to hub formation. Incorporating this construction substantially improves the power of graph-based methods across diverse settings. We further establish a theoretical foundation by proving its consistency under fixed alternatives in both low- and high-dimensional regimes. The effectiveness of the approach is demonstrated through real-world applications, including comparisons of correlation matrices for brain regions, gene expression profiles of T cells, and temporal changes in New York City taxi travel patterns.

Mitigating dimensionality effects with robust graph constructions for testing

TL;DR

-NNG (r

-NNG) augments the standard objective with a penalty on

and is optimized via a greedy procedure, yielding substantial power gains, especially for variance/scale differences. The authors extend the asymptotic theory to directed graphs, proving consistency and deriving conditions under which the GET statistic on the r

-NNG converges to a

distribution under the permutation null, with practical guidance on selecting the tuning parameter

. Through simulations and real-data analyses (brain MRI, T-cell gene expression, and NYC taxi travels), the robust graph demonstrably improves detection sensitivity and robustness across high-dimensional and non-Euclidean settings, while offering a principled path for parameter tuning and future online-change-point extensions.

Abstract

-nearest neighbor graphs and

-minimum spanning trees among the most effective and widely used. However, in high-dimensional and non-Euclidean regimes such graphs often produce hubs -- nodes with disproportionately high degrees -- to which graph-based methods are especially sensitive. To mitigate these dimensionality effects, we propose a robust graph construction that is far less prone to hub formation. Incorporating this construction substantially improves the power of graph-based methods across diverse settings. We further establish a theoretical foundation by proving its consistency under fixed alternatives in both low- and high-dimensional regimes. The effectiveness of the approach is demonstrated through real-world applications, including comparisons of correlation matrices for brain regions, gene expression profiles of T cells, and temporal changes in New York City taxi travel patterns.

Paper Structure (16 sections, 4 theorems, 8 equations, 19 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 4 theorems, 8 equations, 19 figures, 5 tables, 1 algorithm.

Introduction
Dimensionality effects on GET with the $K$-NNG
Relationship between hubs and dimensionality
A robust graph construction to mitigate hubness
Performance of GET on the robust $K$-NNG (r$K$-NNG)
Choice of $\lambda$
Asymptotic properties of the GET statistic on the r$K$-NNG
Numerical studies
Two-sample testing
Change-point detection
Real-data examples
Correlation matrix of brain regions
Gene expression in T cells subtypes
New York City taxi travel pattern
Conclusion and discussion
...and 1 more sections

Key Result

Lemma 4

For a directed graph $G$ with $|G| = O(N^\alpha), 1\leq \alpha<2$, under conditions in the usual limit regime, we have $S\xrightarrow{\mathcal{D}} \chi_2^2$ under the permutation null distribution.

Figures (19)

Figure 1: Estimated power for different two-sample tests.
Figure 2: Estimated power of GET on the $5$-NNG and the $14$-NNG under no perturbation (red), random perturbation (green), and outlier perturbation (orange).
Figure 3: Estimated power of GET on the $5$-NNG and the $14$-NNG under no perturbation (red), outlier perturbation (orange), and hub perturbation (blue).
Figure 4: Average degree of perturbed observations in the $K$-NNG under the settings used in Figures \ref{['Estimated power on 5NNG and 14NNG']} and \ref{['Estimated power with inlier']}, with $\sigma^2 = 1.02$.
Figure 5: Empirical degree distributions of the $5$-NNG for data drawn from the standard multivariate normal distribution (top panel) and the earlier two-distribution example with $\sigma^2=1.02$ (bottom panel).
...and 14 more figures

Theorems & Definitions (8)

Remark 1
Remark 2
Remark 3
Lemma 4
Theorem 5
Remark 6
Theorem 7: Consistency under fixed dimensions
Theorem 8: Consistency under high dimensions

Mitigating dimensionality effects with robust graph constructions for testing

TL;DR

Abstract

Mitigating dimensionality effects with robust graph constructions for testing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (8)