Table of Contents
Fetching ...

The Chi-Square Test of Distance Correlation

Cencheng Shen, Sambit Panda, Joshua T. Vogelstein

TL;DR

It is proved the chi-squared test can be valid and universally consistent for testing independence, and established a testing power inequality with respect to the permutation test.

Abstract

Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of dependency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depends on the underlying random variables and metric choice, it typically requires a permutation test to estimate the null and compute the p-value, which is very costly for large amount of data. To overcome the difficulty, in this paper we propose a chi-square test for distance correlation. Method-wise, the chi-square test is non-parametric, extremely fast, and applicable to bias-corrected distance correlation using any strong negative type metric or characteristic kernel. The test exhibits a similar testing power as the standard permutation test, and can be utilized for K-sample and partial testing. Theory-wise, we show that the underlying chi-square distribution well approximates and dominates the limiting null distribution in upper tail, prove the chi-square test can be valid and universally consistent for testing independence, and establish a testing power inequality with respect to the permutation test.

The Chi-Square Test of Distance Correlation

TL;DR

It is proved the chi-squared test can be valid and universally consistent for testing independence, and established a testing power inequality with respect to the permutation test.

Abstract

Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of dependency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depends on the underlying random variables and metric choice, it typically requires a permutation test to estimate the null and compute the p-value, which is very costly for large amount of data. To overcome the difficulty, in this paper we propose a chi-square test for distance correlation. Method-wise, the chi-square test is non-parametric, extremely fast, and applicable to bias-corrected distance correlation using any strong negative type metric or characteristic kernel. The test exhibits a similar testing power as the standard permutation test, and can be utilized for K-sample and partial testing. Theory-wise, we show that the underlying chi-square distribution well approximates and dominates the limiting null distribution in upper tail, prove the chi-square test can be valid and universally consistent for testing independence, and establish a testing power inequality with respect to the permutation test.

Paper Structure

This paper contains 31 sections, 17 theorems, 48 equations, 4 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

\newlabelthm70 The distance correlation chi-square test that rejects independence if and only if is a valid and universally consistent test for sufficiently large $n$ and sufficiently small type 1 error level $\alpha$.

Figures (4)

  • Figure 1: The top row compares the centered chi-square distribution, the normal distribution, and the actual null distribution of distance correlation in case of varying dimensions. The bottom row shows the weights used in the limiting null distribution in each case.
  • Figure 2: Evaluate distance correlation using different tests for linear, quadratic, spiral, and independent simulations. The top row shows the power using the Euclidean distance, the center row shows the power using the Gaussian kernel, and the bottom row shows the running time (in log scale) for each method in the top row.
  • Figure 3: Evaluate distance correlation using different tests for four increasing-dimensional simulations using Euclidean distance. The first row shows the testing power in each simulation, and the second row shows the running time in log scale for each method in the respective first row.
  • Figure 4: The Testing Power for the simulation in Figure \ref{['fig1']}. The left panel fix $n=200$, and let $p,q$ increases; the right panel set $p=q=50$, and let $n$ increases.

Theorems & Definitions (30)

  • Theorem 1
  • Theorem 2
  • Corollary 3
  • Corollary 4
  • Theorem 1
  • Definition 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Corollary 6
  • ...and 20 more