A generalization of a U-statistics-based MCAR Test: Utilizing Partially Observed Variables
Danijel Aleksić
TL;DR
This work generalizes a U-statistics-based MCAR test to exploit partially observed variables, extending the original $A_n$ framework by incorporating covariances between incomplete data and response indicators through new statistics $T_{n,X}^{(u,v)}$, $T_{n,Y}^{(u,v)}$, and $\hat{T}_{n,Y}^{(u,v)}$. The expanded test statistic $A_n'$ combines these components with an estimated covariance $\hat{\Lambda}$ to achieve a asymptotic $\chi^2_{pq+q(q-1)}$ distribution under MCAR, enabling detection of a broader class of alternatives. Extensive simulations show superior calibration and robustness to finite fourth moment assumptions, with improved power relative to Little's MCAR test in most practical settings and especially in scenarios where the old test misses alternatives. The approach remains scalable to higher dimensions and avoids strict limitations of prior implementations, though MNAR-alone scenarios remain challenging. Overall, the method provides a more flexible and powerful tool for MCAR assessment in datasets with partially observed variables, offering practical benefits for complete-case analyses and missing-data inference.
Abstract
In this paper, a generalized version of a U-statistics-based test for MCAR developed by Aleksić (2024) is presented. The novel test, similar to the original, tests for MCAR by calculating and combining the covariances between the response indicators and the data variables. However, unlike the old test, it is able to utilize partially observed variables, resulting in a significantly larger class of detectable alternatives. The novel test appears to be well calibrated, much better than the Little's MCAR test that was used as a benchmark. For the alternatives that were detectable for the old test, the novel test has comparable, although slightly lower, power as the old one, but is still able to outperform Little's test in all of the studied scenarios. For alternatives that were previously undetectable or barely detectable, the novel test performs the best of three. The novel test has the same assumption of finite fourth moments of the data, the same assumption necessary for Little's test. The results indicate that the novel test is more robust to this assumption, although both tests have similar limitations.
