Quantifying and testing dependence to categorical variables

Siegfried Hörmann; Daniel Strenger-Galvis

Quantifying and testing dependence to categorical variables

Siegfried Hörmann, Daniel Strenger-Galvis

TL;DR

The paper introduces a permutation-invariant dependence coefficient $\psi(X,Y)$ for a K-level categorical response $Y$ and a covariate $X$ in a separable metric space, connecting independence to a K-sample framework and functional dependence to the extremal value $\psi=1$. It builds a practical estimator using nearest-neighbor proxies for an unobserved coupling variable and proves consistency under mild conditions, along with a fast, resampling-free independence test that converges to a $\chi^2((K-1)^2)$ limit under independence and is consistent when dependence is present. The framework extends to conditional dependence via $\psi(Z,Y|X)$ and to broader norm-based variants that preserve the key theoretical properties, including permutation invariance. Empirical evaluation on simulations and real data (including election data, spambase/bankruptcy, and MNIST) demonstrates strong performance and scalability, with an accompanying R package implementing the estimator, test, and variable-selection procedure.

Abstract

We suggest a dependence coefficient between a categorical variable and some general variable taking values in a metric space. We derive important theoretical properties and study the large sample behaviour of our suggested estimator. Moreover, we develop an independence test which has an asymptotic $χ^2$-distribution if the variables are independent and prove that this test is consistent against any violation of independence. The test is also applicable to the classical~$K$-sample problem with possibly high- or infinite-dimensional distributions. We discuss some extensions, including a variant of the coefficient for measuring conditional dependence.

Quantifying and testing dependence to categorical variables

TL;DR

The paper introduces a permutation-invariant dependence coefficient

for a K-level categorical response

and a covariate

in a separable metric space, connecting independence to a K-sample framework and functional dependence to the extremal value

. It builds a practical estimator using nearest-neighbor proxies for an unobserved coupling variable and proves consistency under mild conditions, along with a fast, resampling-free independence test that converges to a

limit under independence and is consistent when dependence is present. The framework extends to conditional dependence via

and to broader norm-based variants that preserve the key theoretical properties, including permutation invariance. Empirical evaluation on simulations and real data (including election data, spambase/bankruptcy, and MNIST) demonstrates strong performance and scalability, with an accompanying R package implementing the estimator, test, and variable-selection procedure.

Abstract

-distribution if the variables are independent and prove that this test is consistent against any violation of independence. The test is also applicable to the classical~

-sample problem with possibly high- or infinite-dimensional distributions. We discuss some extensions, including a variant of the coefficient for measuring conditional dependence.

Quantifying and testing dependence to categorical variables

TL;DR

Abstract

Quantifying and testing dependence to categorical variables

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (55)