Quantifying and testing dependence to categorical variables
Siegfried Hörmann, Daniel Strenger-Galvis
TL;DR
The paper introduces a permutation-invariant dependence coefficient $\psi(X,Y)$ for a K-level categorical response $Y$ and a covariate $X$ in a separable metric space, connecting independence to a K-sample framework and functional dependence to the extremal value $\psi=1$. It builds a practical estimator using nearest-neighbor proxies for an unobserved coupling variable and proves consistency under mild conditions, along with a fast, resampling-free independence test that converges to a $\chi^2((K-1)^2)$ limit under independence and is consistent when dependence is present. The framework extends to conditional dependence via $\psi(Z,Y|X)$ and to broader norm-based variants that preserve the key theoretical properties, including permutation invariance. Empirical evaluation on simulations and real data (including election data, spambase/bankruptcy, and MNIST) demonstrates strong performance and scalability, with an accompanying R package implementing the estimator, test, and variable-selection procedure.
Abstract
We suggest a dependence coefficient between a categorical variable and some general variable taking values in a metric space. We derive important theoretical properties and study the large sample behaviour of our suggested estimator. Moreover, we develop an independence test which has an asymptotic $χ^2$-distribution if the variables are independent and prove that this test is consistent against any violation of independence. The test is also applicable to the classical~$K$-sample problem with possibly high- or infinite-dimensional distributions. We discuss some extensions, including a variant of the coefficient for measuring conditional dependence.
