Tests for categorical data beyond Pearson: A distance covariance and energy distance approach

Fernando Castro-Prado; Wenceslao González-Manteiga; Javier Costas; Fernando Facal; Dominic Edelmann

Tests for categorical data beyond Pearson: A distance covariance and energy distance approach

Fernando Castro-Prado, Wenceslao González-Manteiga, Javier Costas, Fernando Facal, Dominic Edelmann

Abstract

Categorical variables are of uttermost importance in biomedical research. When two of them are considered, it is often the case that one wants to test whether or not they are statistically dependent. We show weaknesses of classical methods -- such as Pearson's and the G-test -- and we propose testing strategies based on distances that lack those drawbacks. We first develop this theory for classical two-dimensional contingency tables, within the context of distance covariance, an association measure that characterises general statistical independence of two variables. We then apply the same fundamental ideas to one-dimensional tables, namely to the testing for goodness of fit to a discrete distribution, for which we resort to an analogous statistic called energy distance. We prove that our methodology has desirable theoretical properties, and we show how we can calibrate the null distribution of our test statistics without resorting to any resampling technique. We illustrate all this in simulations, as well as with some real data examples, demonstrating the adequate performance of our approach for biostatistical practice.

Tests for categorical data beyond Pearson: A distance covariance and energy distance approach

Abstract

Paper Structure (8 sections, 2 theorems, 74 equations, 2 figures, 1 table)

This paper contains 8 sections, 2 theorems, 74 equations, 2 figures, 1 table.

Introduction
The distance covariance test of independence between two categorical variables
The energy test for goodness of fit to a discrete distribution
Simulation study
Real data analyses
Discussion and Conclusion
Proof of theorem \ref{['th:indep']}
Proof of theorem \ref{['th:gof']}

Key Result

theorem 1

Let $(X_1,\ldots,X_n)$ and $(Y_1,\ldots,Y_n)$ be IID samples of jointly distributed random variables $(X,Y) \in \{1,2,\ldots,I\} \times\{1,2,\ldots,J\}$, with $q_i := P(X=i)$ and $r_j := P(Y=j)$. Consider $\mathcal{X}$ and $\mathcal{Y}$ equipped with the discrete metric. Then the empirical distance In addition, whenever $X$ and $Y$ are independent, for $n \to \infty$, where $Z_{ij}^2$ are indepe

Figures (2)

Figure 1: Nominal significance level ($\alpha$) versus empirical power under the null hypothesis ($\hat{\alpha}$), for the decaying marginals model, comparing our distance covariance method (golden points), Pearson's chi-squared test (pale blue), Pearson's test with permutations (dark red), the USP (black), Fisher's exact test (green) and the $G$-test (purple). The grey shadow is a 95 % confidence band for $\hat{\alpha}$ given $\alpha$.
Figure 2: Power curve comparison for the decaying marginals model, comparing our distance covariance method (golden curve), Pearson's chi-squared test (pale blue), Pearson's test with permutations (dark red), the USP (black), Fisher's exact test (green) and the $G$-test (purple). The $5\times 8$ cells of each contingency table were filled with $n=100$ observations. $M=10^4$ replicates were considered. Error bars span from $-3$ to $+3$ standard deviations for each value of parameter $\varepsilon$, which indicates the distance from the null hypothesis.

Theorems & Definitions (2)

theorem 1
theorem 2

Tests for categorical data beyond Pearson: A distance covariance and energy distance approach

Abstract

Tests for categorical data beyond Pearson: A distance covariance and energy distance approach

Authors

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (2)