Table of Contents
Fetching ...

Power properties of the two-sample test based on the nearest neighbors graph

Rahul Raphael Kanekar

TL;DR

This work advances nonparametric two-sample testing in multivariate settings by analyzing graph-based tests built on K-nearest neighbor graphs with growing K_N. It derives CLTs for the Poissonized statistic under both null and general alternatives, establishing detection thresholds and local power in parametric families, and demonstrates that a 2-sided version eliminates an exponent gap present in the traditional 1-sided approach. A phase transition in detectable dimensionality persists when K_N grows, with higher dimensions improving power for the 2-sided test and bringing its performance closer to the likelihood-ratio test. The paper also proves consistency via Henze-Penrose dissimilarities and validates the theory through simulations, including higher-dimensional scenarios. Overall, increasing the graph density (larger k_N) enhances power and the 2-sided test provides more robust, dimension-friendly performance.

Abstract

In this paper, we study the problem of testing the equality of two multivariate distributions. One class of tests used for this purpose utilizes geometric graphs constructed using inter-point distances. So far, the asymptotic theory of these tests applies only to graphs which fall under the stabilizing graphs framework of \citet{penroseyukich2003weaklaws}. We study the case of the $K$-nearest neighbors graph where $K=k_N$ increases with the sample size, which does not fall under the stabilizing graphs framework. Our main result gives detection thresholds for this test in parametrized families when $k_N = o(N^{1/4})$, thus extending the family of graphs where the theoretical behavior is known. We propose a 2-sided version of the test which removes an exponent gap that plagues the 1-sided test. Our result also shows that increasing the number of nearest neighbors boosts the power of the test. This provides theoretical justification for using denser graphs in testing equality of two distributions.

Power properties of the two-sample test based on the nearest neighbors graph

TL;DR

This work advances nonparametric two-sample testing in multivariate settings by analyzing graph-based tests built on K-nearest neighbor graphs with growing K_N. It derives CLTs for the Poissonized statistic under both null and general alternatives, establishing detection thresholds and local power in parametric families, and demonstrates that a 2-sided version eliminates an exponent gap present in the traditional 1-sided approach. A phase transition in detectable dimensionality persists when K_N grows, with higher dimensions improving power for the 2-sided test and bringing its performance closer to the likelihood-ratio test. The paper also proves consistency via Henze-Penrose dissimilarities and validates the theory through simulations, including higher-dimensional scenarios. Overall, increasing the graph density (larger k_N) enhances power and the 2-sided test provides more robust, dimension-friendly performance.

Abstract

In this paper, we study the problem of testing the equality of two multivariate distributions. One class of tests used for this purpose utilizes geometric graphs constructed using inter-point distances. So far, the asymptotic theory of these tests applies only to graphs which fall under the stabilizing graphs framework of \citet{penroseyukich2003weaklaws}. We study the case of the -nearest neighbors graph where increases with the sample size, which does not fall under the stabilizing graphs framework. Our main result gives detection thresholds for this test in parametrized families when , thus extending the family of graphs where the theoretical behavior is known. We propose a 2-sided version of the test which removes an exponent gap that plagues the 1-sided test. Our result also shows that increasing the number of nearest neighbors boosts the power of the test. This provides theoretical justification for using denser graphs in testing equality of two distributions.

Paper Structure

This paper contains 33 sections, 26 theorems, 319 equations, 6 figures, 1 table.

Key Result

Proposition 3.1

Let $f,g$ be two densities on $\mathbb{R}^d.$ Let $\{k_N\}_{N\geq 1}$ be a sequence of natural numbers such that $k_N = o(N)$. Then,

Figures (6)

  • Figure 1: On the left is the undirected MST formed from 10 samples of $N(0,I_2)$(coloured red) and 10 samples of $N(0.2,I_2)$ (coloured green). On the right is MST formed out of 10 samples each of $N(0,I_2)$(red) and $N(2,I_2)$(green). The edges going across samples are colored black. Edges within samples are colored gold.
  • Figure 2: On the left is the directed $3-$NN graph formed from 10 samples of $N(0,I_2)$(coloured red) and 10 samples of $N(0.2,I_2)$ (coloured blue). On the right is $3-$NN graph formed out of 10 samples each of $N(0,I_2)$(red) and $N(2,I_2)$(blue). The edges going from sample $1$ to samples $2$ are colored black. Edges within samples are colored gold.
  • Figure 3: The limiting power for the 1-sided test in the spherical normal family with $d=6.$ The left hand panel shows the power of the test with $k_N = 2,5,8,10$ which corresponds to taking $k_N = N^\delta$ for $\delta<1/4$. The right hand panel has $k_N = 20,50,100,200$ which corresponds to $k_N = N^\delta$ for $1/4<\delta<1$ with sample size $N=20000$.
  • Figure 4: The limiting power for the 1-sided test in the spherical normal family with $d=6.$ The left hand panel shows the power of the test with $k_N = 2,5,8,10$ which corresponds to taking $k_N = N^\delta$ for $\delta<1/4$. The right hand panel has $k_N = 20,50,100,200$ which corresponds to $k_N = N^\delta$ for $1/4<\delta<1$ with sample size $N=20000$
  • Figure 5: Heatmaps of limiting power for the 1-sided and 2-sided tests in the spherical normal family with $d=25$ for $h>0.$ On the X-axis is the exponent $b$ and on the Y-axis is the number of neighbors $k_N$. Shades of red denote low power and shades of green denote high power.
  • ...and 1 more figures

Theorems & Definitions (43)

  • Definition 2.1
  • Example 2.1
  • Example 2.2
  • Example 2.3
  • Definition 3.1
  • Proposition 3.1
  • Theorem 4.1
  • Theorem 4.2
  • Lemma 5.1
  • Theorem 5.1
  • ...and 33 more