Table of Contents
Fetching ...

A new measure of dependence: Integrated $R^2$

Mona Azadkia, Pouya Roudaki

Abstract

We introduce a novel measure of dependence that captures the extent to which a random variable $Y$ is determined by a random vector $X$. The measure equals zero precisely when $Y$ and $X$ are independent, and it attains one exactly when $Y$ is almost surely a measurable function of $X$. We further extend this framework to define a measure of conditional dependence between $Y$ and $X$ given $Z$. We propose a simple and interpretable estimator with computational complexity comparable to classical correlation coefficients, including those of Pearson, Spearman, and Chatterjee. Leveraging this dependence measure, we develop a tuning-free, model-agnostic variable selection procedure and establish its consistency under appropriate sparsity conditions. Extensive experiments on synthetic and real datasets highlight the strong empirical performance of our methodology and demonstrate substantial gains over existing approaches.

A new measure of dependence: Integrated $R^2$

Abstract

We introduce a novel measure of dependence that captures the extent to which a random variable is determined by a random vector . The measure equals zero precisely when and are independent, and it attains one exactly when is almost surely a measurable function of . We further extend this framework to define a measure of conditional dependence between and given . We propose a simple and interpretable estimator with computational complexity comparable to classical correlation coefficients, including those of Pearson, Spearman, and Chatterjee. Leveraging this dependence measure, we develop a tuning-free, model-agnostic variable selection procedure and establish its consistency under appropriate sparsity conditions. Extensive experiments on synthetic and real datasets highlight the strong empirical performance of our methodology and demonstrate substantial gains over existing approaches.

Paper Structure

This paper contains 29 sections, 21 theorems, 182 equations, 9 figures, 4 tables.

Key Result

Theorem 2.1

For random variables $Y$ and $\mathbf{X}$ such that $Y$ is not almost surely a constant, $\nu(Y, \mathbf{X})$ belongs to the interval $[0, 1]$, it is $0$ if and only if $Y$ and $\mathbf{X}$ are independent, and it is $1$ if and only if there exists a measurable function $f:\mathbb{R}^p\rightarrow\ma

Figures (9)

  • Figure 1: Values of $\nu_n(Y, X)$ for various kinds of scatterplots with $n = 100$. Noise increases from left to right.
  • Figure 2: Histogram of $10000$ simulations of $\nu_{n}^{\text{1-dim}}(Y, X)$ with $X$ and $Y$ independently distributed as Uniform$[0, 1]$, overlaid with the asymptotic normal density $N(\mu_n, \sigma_n^2)$, where $\mu_n = 2/n$ and $\sigma_n^2 = (\pi^2/3 - 3)/n$.
  • Figure 3: Histogram of $10{,}000$ simulations of $\nu_{n}^{\text{1-dim}}(Y, X)$ under the dependence structure between $X$ and $Y$ described in Example \ref{['exAsymptotic']}, overlaid with the normal density curve whose estimated mean and standard deviation are $0.314$ and $0.02$, respectively.
  • Figure 4: Comparison of power of several tests of independence described in Example \ref{['exPowerAnalysis']}. The level of the noise or homoskedasticity increases from left to right. In each case, the sample size is 100, and 500 simulations were used to estimate the power. The p-values were calculated using 1000 independent permutations.
  • Figure 5: Comparison of the empirical power of several tests of independence described in Example \ref{['exPowerAnalysis']}. The noise level (or degree of homoskedasticity) increases from left to right. The sample size is $n=100$, and power is estimated based on 500 Monte Carlo simulations. P-values are computed using 1,000 independent permutations.
  • ...and 4 more figures

Theorems & Definitions (53)

  • Theorem 2.1
  • Remark 2.2
  • Theorem 2.3
  • Theorem 3.1
  • Corollary 3.2
  • Remark 3.3
  • Theorem 3.4
  • Proposition 3.5
  • Proposition 3.6
  • Theorem 3.7
  • ...and 43 more