Table of Contents
Fetching ...

A direct extension of Azadkia & Chatterjee's rank correlation to multi-response vectors

Jonathan Ansari, Sebastian Fuchs

TL;DR

This work directly generalizes Chatterjee's rank correlation ξ to multivariate responses by introducing a scale-invariant predictability measure $T$, built by converting a multivariate regression problem into a univariate conditional-dependence framework. $T$ satisfies the core axioms of a measure of predictability, the information gain inequality, and a conditional-independence characterization, while admitting a fast, nonparametric, rank-based estimator $T_n$ with asymptotic normality. A permutation-invariant variant $ar{T}$ extends applicability to unordered response components, and closed-form MVN results provide intuition on dependence structure. The authors leverage these properties to develop MFOCI, a model-free, tuning-parameter-free multivariate feature ordering and selection method that scales to multi-output data and demonstrates strong performance against existing approaches in simulations and real data. Overall, the framework enables efficient, interpretable quantification of dependence and robust variable selection for multi-outcome problems across domains.

Abstract

Recently, Chatterjee (2023) recognized the lack of a direct generalization of his rank correlation $ξ$ in Azadkia and Chatterjee (2021) to a multi-dimensional response vector. As a natural solution to this problem, we here propose an extension of $ξ$ that is applicable to a set of $q \geq 1$ response variables, where our approach builds upon converting the original vector-valued problem into a univariate problem and then applying the rank correlation $ξ$ to it. Our novel measure $T$ quantifies the scale-invariant extent of functional dependence of a response vector $\mathbf{Y} = (Y_1,\dots,Y_q)$ on predictor variables $\mathbf{X} = (X_1, \dots,X_p)$, characterizes independence of $\mathbf{X}$ and $\mathbf{Y}$ as well as perfect dependence of $\mathbf{Y}$ on $\mathbf{X}$ and hence fulfills all the characteristics of a measure of predictability. Aiming at maximum interpretability, we provide various invariance results for $T$ as well as a closed-form expression in multivariate normal models. Building upon the graph-based estimator for $ξ$ in Azadkia and Chatterjee (2021), we obtain a non-parametric, strongly consistent estimator for $T$ and show -- as a main contribution -- its asymptotic normality. Based on this estimator, we develop a model-free and rank-based feature ranking and forward feature selection for multiple-outcome data that works without any tuning parameters. Simulation results and real case studies illustrate $T$'s broad applicability.

A direct extension of Azadkia & Chatterjee's rank correlation to multi-response vectors

TL;DR

This work directly generalizes Chatterjee's rank correlation ξ to multivariate responses by introducing a scale-invariant predictability measure , built by converting a multivariate regression problem into a univariate conditional-dependence framework. satisfies the core axioms of a measure of predictability, the information gain inequality, and a conditional-independence characterization, while admitting a fast, nonparametric, rank-based estimator with asymptotic normality. A permutation-invariant variant extends applicability to unordered response components, and closed-form MVN results provide intuition on dependence structure. The authors leverage these properties to develop MFOCI, a model-free, tuning-parameter-free multivariate feature ordering and selection method that scales to multi-output data and demonstrates strong performance against existing approaches in simulations and real data. Overall, the framework enables efficient, interpretable quantification of dependence and robust variable selection for multi-outcome problems across domains.

Abstract

Recently, Chatterjee (2023) recognized the lack of a direct generalization of his rank correlation in Azadkia and Chatterjee (2021) to a multi-dimensional response vector. As a natural solution to this problem, we here propose an extension of that is applicable to a set of response variables, where our approach builds upon converting the original vector-valued problem into a univariate problem and then applying the rank correlation to it. Our novel measure quantifies the scale-invariant extent of functional dependence of a response vector on predictor variables , characterizes independence of and as well as perfect dependence of on and hence fulfills all the characteristics of a measure of predictability. Aiming at maximum interpretability, we provide various invariance results for as well as a closed-form expression in multivariate normal models. Building upon the graph-based estimator for in Azadkia and Chatterjee (2021), we obtain a non-parametric, strongly consistent estimator for and show -- as a main contribution -- its asymptotic normality. Based on this estimator, we develop a model-free and rank-based feature ranking and forward feature selection for multiple-outcome data that works without any tuning parameters. Simulation results and real case studies illustrate 's broad applicability.
Paper Structure (31 sections, 28 theorems, 137 equations, 6 figures, 9 tables)

This paper contains 31 sections, 28 theorems, 137 equations, 6 figures, 9 tables.

Key Result

Theorem 2.1

The map $T$ defined by defmdm

Figures (6)

  • Figure 1: Boxplots comparing the $500$ obtained normalized information gains in \ref{['infgain']} for $\rho^2$ estimated via R function KMAc (R package KPC) with those for $T$ estimated via R function didec (R package didec). Sample size is $1,000$ and $\alpha$ is varying over $\alpha \in \{0.1, 0.3, 0.7, 1, 3, 7, 10, 30, 100, 300\}$ from left to right.
  • Figure 2: Boxplots summarizing the $1,000$ obtained estimates for $T_n$. Samples of size $n$ are drawn from a multivariate normal distribution with $5$ predictor and $2$ response variables (left panel) and with $2$ predictor and $4$ response variables (right panel).
  • Figure 3: Boxplots for varying $\sigma \in \{0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1\}$ from left to right comparing the $1,000$ obtained dependence values of the static convex combination $\kappa^{\boldsymbol{\alpha}}((Y_1,Y_2),X)$ in \ref{['KappaAverage2']} with fixed weights $(\alpha_1,\alpha_2)=(0.5,0.5)$ estimated via R function codec (R package FOCI) with those of $T((Y_1,Y_2),X)$ in \ref{['defmdm']} estimated via R function didec (R package didec). Since $(Y_1,Y_2)$ is independent of $X$, the true dependence value equals $0$ (depicted by the red dashed line).
  • Figure 4: Interconnectedness of the three largest banks in the US, Europe and Asia and Pacific, as well as connectedness with the banks Citigroup and Deutsche Bank measured by $T\,;$ see Subsection \ref{['secnetworks']} for details.
  • Figure 5: Projection of the random variable $\mathds{1}_F\circ Y$ onto the plane spanned by ${\mathbf{X}}=(X_1,X_2)$ as well as projections of $\mathds{1}_F\circ Y$ and $\mathbb{E}[\mathds{1}_F\circ Y|{\mathbf{X}}]$ onto the line spanned by $X_1\,,$ see Section \ref{['Geom.Sub.1']} for interpretations regarding $T(Y,{\mathbf{X}})\,.$ Lengths of vectors are measured w.r.t. the $\lVert \cdot \rVert_{L^2}$-norm of the Hilbert space $L^2(\Omega)\,.$
  • ...and 1 more figures

Theorems & Definitions (57)

  • Theorem 2.1: $T$ as a measure of predictability
  • Corollary 2.2: Data processing inequality
  • Corollary 2.3: Self-equitability
  • Proposition 2.4: Distribution invariance
  • Proposition 2.5: Dimension reduction principle
  • Corollary 2.6: $\overline{T}$ as a measure of predictability
  • Proposition 2.7: Closed-form expression for the multivariate normal distribution
  • Proposition 2.8: Characterization of extreme cases
  • Theorem 3.1: Consistency
  • Remark 3.2
  • ...and 47 more