Table of Contents
Fetching ...

Azadkia-Chatterjee's dependence coefficient for infinite dimensional data

Siegfried Hörmann, Daniel Strenger

TL;DR

This work extends Azadkia-Chatterjee's dependence coefficient to covariates X in general metric spaces, including infinite-dimensional functional data, and analyzes the associated NN-based estimator. It reveals that nearest-neighbor degree can diverge polynomially in functional spaces, undermining standard asymptotics, and then establishes a data-dependent, self-normalized CLT for an independence test that remains universally consistent under mild conditions. The paper provides verifiable conditions for Gaussian functional data and demonstrates the method on Austrian municipalities' age-structure curves with COVID-19 vaccination data, showing strong dependence and favorable computational performance. These results offer guidance for applying graph-based dependence measures in infinite-dimensional settings and highlight the need to account for growing NN degrees in practice.

Abstract

We extend the scope of Azadkia-Chatterjee's dependence coefficient between a scalar response $Y$ and a multivariate covariate $X$ to the case where $X$ takes values in a general metric space. Particular attention is paid to the case where $X$ is a curve. Although extending this framework at the population level is relatively straightforward, analyzing the asymptotic behavior of the estimator proves to be complex. This complexity is largely related to the nearest neighbor structure of the infinite-dimensional covariate sample, leading us to explore a topic that has not been previously addressed in the literature. The primary contribution of this paper is to provide insights into this issue and propose strategies to address it. Our findings also have significant implications for other graph-based methods facing similar challenges.

Azadkia-Chatterjee's dependence coefficient for infinite dimensional data

TL;DR

This work extends Azadkia-Chatterjee's dependence coefficient to covariates X in general metric spaces, including infinite-dimensional functional data, and analyzes the associated NN-based estimator. It reveals that nearest-neighbor degree can diverge polynomially in functional spaces, undermining standard asymptotics, and then establishes a data-dependent, self-normalized CLT for an independence test that remains universally consistent under mild conditions. The paper provides verifiable conditions for Gaussian functional data and demonstrates the method on Austrian municipalities' age-structure curves with COVID-19 vaccination data, showing strong dependence and favorable computational performance. These results offer guidance for applying graph-based dependence measures in infinite-dimensional settings and highlight the need to account for growing NN degrees in practice.

Abstract

We extend the scope of Azadkia-Chatterjee's dependence coefficient between a scalar response and a multivariate covariate to the case where takes values in a general metric space. Particular attention is paid to the case where is a curve. Although extending this framework at the population level is relatively straightforward, analyzing the asymptotic behavior of the estimator proves to be complex. This complexity is largely related to the nearest neighbor structure of the infinite-dimensional covariate sample, leading us to explore a topic that has not been previously addressed in the literature. The primary contribution of this paper is to provide insights into this issue and propose strategies to address it. Our findings also have significant implications for other graph-based methods facing similar challenges.
Paper Structure (13 sections, 23 theorems, 102 equations, 5 figures, 2 tables)

This paper contains 13 sections, 23 theorems, 102 equations, 5 figures, 2 tables.

Key Result

Theorem 1

Assume that $X$ takes values in a separable metric space $(H,d)$. Let Assumption ass:cont hold. Then we have that

Figures (5)

  • Figure 1: Age distribution curves for 2117 municipalities in Austria. On the $y$-axis we see the proportions. The curve related to Wolfsberg (solid red) is the nearest neighbor of the age distribution curves of 66 municipalities (blue).
  • Figure 2: Age curves are colored according to the vaccination rates of the corresponding municipalities.
  • Figure 3: Visualization of the sqnorm (top) and sin (bottom) relationships. The $n=1000$ curves $X$ are colored according to $Y$. The bar plots compare the values of $\widehat{T}_n$ and $\widehat{R}_n$.
  • Figure 4: Histogram of $\mathcal{I}_n$ under independence and comparison to the standard normal density.
  • Figure 5: Estimated powers of the three tests for independence at different levels of noise.

Theorems & Definitions (47)

  • Remark 1
  • Remark 2
  • Theorem 1
  • Theorem 2
  • Remark 3
  • Theorem 3
  • Remark 4
  • Corollary 1
  • Theorem 4
  • Lemma 1
  • ...and 37 more