Table of Contents
Fetching ...

Mitigating dimensionality effects with robust graph constructions for testing

Yejiong Zhu, Hao Chen

TL;DR

This work addresses dimensionality-induced hub effects in graph-based nonparametric two-sample testing and change-point detection by introducing a robust graph construction that penalizes high node degrees. The robust $K$-NNG (r$K$-NNG) augments the standard objective with a penalty on $\sum_i |G_i|^2$ and is optimized via a greedy procedure, yielding substantial power gains, especially for variance/scale differences. The authors extend the asymptotic theory to directed graphs, proving consistency and deriving conditions under which the GET statistic on the r$K$-NNG converges to a $\chi^2_2$ distribution under the permutation null, with practical guidance on selecting the tuning parameter $\lambda$. Through simulations and real-data analyses (brain MRI, T-cell gene expression, and NYC taxi travels), the robust graph demonstrably improves detection sensitivity and robustness across high-dimensional and non-Euclidean settings, while offering a principled path for parameter tuning and future online-change-point extensions.

Abstract

Dimensionality effects pose major challenges in high-dimensional and non-Euclidean data analysis. Graph-based two-sample tests and change-point detection are particularly attractive in this context, as they make minimal distributional assumptions and perform well across a wide range of scenarios. These methods rely on similarity graphs constructed from data, with $K$-nearest neighbor graphs and $K$-minimum spanning trees among the most effective and widely used. However, in high-dimensional and non-Euclidean regimes such graphs often produce hubs -- nodes with disproportionately high degrees -- to which graph-based methods are especially sensitive. To mitigate these dimensionality effects, we propose a robust graph construction that is far less prone to hub formation. Incorporating this construction substantially improves the power of graph-based methods across diverse settings. We further establish a theoretical foundation by proving its consistency under fixed alternatives in both low- and high-dimensional regimes. The effectiveness of the approach is demonstrated through real-world applications, including comparisons of correlation matrices for brain regions, gene expression profiles of T cells, and temporal changes in New York City taxi travel patterns.

Mitigating dimensionality effects with robust graph constructions for testing

TL;DR

This work addresses dimensionality-induced hub effects in graph-based nonparametric two-sample testing and change-point detection by introducing a robust graph construction that penalizes high node degrees. The robust -NNG (r-NNG) augments the standard objective with a penalty on and is optimized via a greedy procedure, yielding substantial power gains, especially for variance/scale differences. The authors extend the asymptotic theory to directed graphs, proving consistency and deriving conditions under which the GET statistic on the r-NNG converges to a distribution under the permutation null, with practical guidance on selecting the tuning parameter . Through simulations and real-data analyses (brain MRI, T-cell gene expression, and NYC taxi travels), the robust graph demonstrably improves detection sensitivity and robustness across high-dimensional and non-Euclidean settings, while offering a principled path for parameter tuning and future online-change-point extensions.

Abstract

Dimensionality effects pose major challenges in high-dimensional and non-Euclidean data analysis. Graph-based two-sample tests and change-point detection are particularly attractive in this context, as they make minimal distributional assumptions and perform well across a wide range of scenarios. These methods rely on similarity graphs constructed from data, with -nearest neighbor graphs and -minimum spanning trees among the most effective and widely used. However, in high-dimensional and non-Euclidean regimes such graphs often produce hubs -- nodes with disproportionately high degrees -- to which graph-based methods are especially sensitive. To mitigate these dimensionality effects, we propose a robust graph construction that is far less prone to hub formation. Incorporating this construction substantially improves the power of graph-based methods across diverse settings. We further establish a theoretical foundation by proving its consistency under fixed alternatives in both low- and high-dimensional regimes. The effectiveness of the approach is demonstrated through real-world applications, including comparisons of correlation matrices for brain regions, gene expression profiles of T cells, and temporal changes in New York City taxi travel patterns.
Paper Structure (16 sections, 4 theorems, 8 equations, 19 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 4 theorems, 8 equations, 19 figures, 5 tables, 1 algorithm.

Key Result

Lemma 4

For a directed graph $G$ with $|G| = O(N^\alpha), 1\leq \alpha<2$, under conditions in the usual limit regime, we have $S\xrightarrow{\mathcal{D}} \chi_2^2$ under the permutation null distribution.

Figures (19)

  • Figure 1: Estimated power for different two-sample tests.
  • Figure 2: Estimated power of GET on the $5$-NNG and the $14$-NNG under no perturbation (red), random perturbation (green), and outlier perturbation (orange).
  • Figure 3: Estimated power of GET on the $5$-NNG and the $14$-NNG under no perturbation (red), outlier perturbation (orange), and hub perturbation (blue).
  • Figure 4: Average degree of perturbed observations in the $K$-NNG under the settings used in Figures \ref{['Estimated power on 5NNG and 14NNG']} and \ref{['Estimated power with inlier']}, with $\sigma^2 = 1.02$.
  • Figure 5: Empirical degree distributions of the $5$-NNG for data drawn from the standard multivariate normal distribution (top panel) and the earlier two-distribution example with $\sigma^2=1.02$ (bottom panel).
  • ...and 14 more figures

Theorems & Definitions (8)

  • Remark 1
  • Remark 2
  • Remark 3
  • Lemma 4
  • Theorem 5
  • Remark 6
  • Theorem 7: Consistency under fixed dimensions
  • Theorem 8: Consistency under high dimensions