Table of Contents
Fetching ...

Quantifying and testing dependence to categorical variables

Siegfried Hörmann, Daniel Strenger-Galvis

TL;DR

The paper introduces a permutation-invariant dependence coefficient $\psi(X,Y)$ for a K-level categorical response $Y$ and a covariate $X$ in a separable metric space, connecting independence to a K-sample framework and functional dependence to the extremal value $\psi=1$. It builds a practical estimator using nearest-neighbor proxies for an unobserved coupling variable and proves consistency under mild conditions, along with a fast, resampling-free independence test that converges to a $\chi^2((K-1)^2)$ limit under independence and is consistent when dependence is present. The framework extends to conditional dependence via $\psi(Z,Y|X)$ and to broader norm-based variants that preserve the key theoretical properties, including permutation invariance. Empirical evaluation on simulations and real data (including election data, spambase/bankruptcy, and MNIST) demonstrates strong performance and scalability, with an accompanying R package implementing the estimator, test, and variable-selection procedure.

Abstract

We suggest a dependence coefficient between a categorical variable and some general variable taking values in a metric space. We derive important theoretical properties and study the large sample behaviour of our suggested estimator. Moreover, we develop an independence test which has an asymptotic $χ^2$-distribution if the variables are independent and prove that this test is consistent against any violation of independence. The test is also applicable to the classical~$K$-sample problem with possibly high- or infinite-dimensional distributions. We discuss some extensions, including a variant of the coefficient for measuring conditional dependence.

Quantifying and testing dependence to categorical variables

TL;DR

The paper introduces a permutation-invariant dependence coefficient for a K-level categorical response and a covariate in a separable metric space, connecting independence to a K-sample framework and functional dependence to the extremal value . It builds a practical estimator using nearest-neighbor proxies for an unobserved coupling variable and proves consistency under mild conditions, along with a fast, resampling-free independence test that converges to a limit under independence and is consistent when dependence is present. The framework extends to conditional dependence via and to broader norm-based variants that preserve the key theoretical properties, including permutation invariance. Empirical evaluation on simulations and real data (including election data, spambase/bankruptcy, and MNIST) demonstrates strong performance and scalability, with an accompanying R package implementing the estimator, test, and variable-selection procedure.

Abstract

We suggest a dependence coefficient between a categorical variable and some general variable taking values in a metric space. We derive important theoretical properties and study the large sample behaviour of our suggested estimator. Moreover, we develop an independence test which has an asymptotic -distribution if the variables are independent and prove that this test is consistent against any violation of independence. The test is also applicable to the classical~-sample problem with possibly high- or infinite-dimensional distributions. We discuss some extensions, including a variant of the coefficient for measuring conditional dependence.

Paper Structure

This paper contains 25 sections, 25 theorems, 91 equations, 3 figures, 3 tables.

Key Result

Proposition 1

For some categorical variable $Y$ let $\mathcal{L}_Y$ be the class of bijections between the $K$ levels of $Y$ and the integers $\{1,\ldots, K\}$. For any $\varepsilon>0$ there exists a categorical variable $Y$ and a random variable $X$ such that

Figures (3)

  • Figure 1: Power functions of the different independence tests. The case $\lambda=0$ corresponds to the null-hypothesis of independence between $Y$ and $X$, while ${\lambda=1}$ corresponds to a functional relation $Y=f(X)$.
  • Figure 2: Illustration of the pixels selected by psicor feature selection. The brighter the pixels the earlier they have been selected.
  • Figure 3: The first eight images from the MNIST training data (top) and their selected pixels (bottom).

Theorems & Definitions (55)

  • Proposition 1
  • Lemma 1
  • Lemma 2
  • Remark 1
  • Remark 2
  • Remark 3
  • Theorem 1
  • Remark 4
  • Remark 5
  • Remark 6
  • ...and 45 more