Table of Contents
Fetching ...

Two-sample Testing with Block-wise Missingness in Multi-source Data

Kejian Zhang, Muxuan Liang, Robert Maile, Doudou Zhou

TL;DR

This work tackles block-wise missingness in multi-source, multi-modal data for two-sample testing by introducing BPET, a general framework that partitions data by missingness patterns, applies pattern-aware statistics, and aggregates them into a global test without imputation or case deletion. It instantiates BPET with the Block-wise Rank In Similarity graph Edge-count (BRISE) test, embedding graph-induced ranks (RISE) to accommodate heterogeneous modalities. The authors establish finite-sample and asymptotic properties, including pattern-wise permutation validity under MNAR and chi-square null distributions, and demonstrate strong finite-sample performance and HDLSS consistency through simulations and real-world datasets (sepsis and Alzheimer's disease). The framework provides robust, scalable tools for valid inference in incomplete, multi-source settings with practical implications for biomedical research and beyond.

Abstract

Multi-source and multi-modal datasets are increasingly common in scientific research, yet they often exhibit block-wise missingness, where entire modalities are systematically absent in some sources or no single source contains all modalities. This structured missingness poses major challenges for two-sample hypothesis testing. Standard approaches, such as imputation or complete-case analysis, may introduce bias or suffer efficiency loss, especially under missingness-not-at-random mechanisms. To address this challenge, we propose the Block-Pattern Enhanced Test, a general framework for constructing two-sample testing statistics that explicitly accounts for block-wise missingness. We show that the framework yields valid tests under a new condition allowing for missing-not-at-random mechanism. Building on this general framework, we further propose the Block-wise Rank In Similarity graph Edge-count (BRISE) test, which accommodate heterogeneous modalities using rank-based similarity graphs. Theoretically, we establish that the null distribution of BRISE converges to a $χ^2$ distribution, and that the test is consistent both in the standard asymptotic regime and in the high-dimensional low-sample-size setting under mild conditions. Simulation studies demonstrate that BRISE controls the type-I error rate and achieves strong power across a wide range of alternatives. Applications to two real-world datasets with block-wise missingness further illustrate the practical utility of the proposed method.

Two-sample Testing with Block-wise Missingness in Multi-source Data

TL;DR

This work tackles block-wise missingness in multi-source, multi-modal data for two-sample testing by introducing BPET, a general framework that partitions data by missingness patterns, applies pattern-aware statistics, and aggregates them into a global test without imputation or case deletion. It instantiates BPET with the Block-wise Rank In Similarity graph Edge-count (BRISE) test, embedding graph-induced ranks (RISE) to accommodate heterogeneous modalities. The authors establish finite-sample and asymptotic properties, including pattern-wise permutation validity under MNAR and chi-square null distributions, and demonstrate strong finite-sample performance and HDLSS consistency through simulations and real-world datasets (sepsis and Alzheimer's disease). The framework provides robust, scalable tools for valid inference in incomplete, multi-source settings with practical implications for biomedical research and beyond.

Abstract

Multi-source and multi-modal datasets are increasingly common in scientific research, yet they often exhibit block-wise missingness, where entire modalities are systematically absent in some sources or no single source contains all modalities. This structured missingness poses major challenges for two-sample hypothesis testing. Standard approaches, such as imputation or complete-case analysis, may introduce bias or suffer efficiency loss, especially under missingness-not-at-random mechanisms. To address this challenge, we propose the Block-Pattern Enhanced Test, a general framework for constructing two-sample testing statistics that explicitly accounts for block-wise missingness. We show that the framework yields valid tests under a new condition allowing for missing-not-at-random mechanism. Building on this general framework, we further propose the Block-wise Rank In Similarity graph Edge-count (BRISE) test, which accommodate heterogeneous modalities using rank-based similarity graphs. Theoretically, we establish that the null distribution of BRISE converges to a distribution, and that the test is consistent both in the standard asymptotic regime and in the high-dimensional low-sample-size setting under mild conditions. Simulation studies demonstrate that BRISE controls the type-I error rate and achieves strong power across a wide range of alternatives. Applications to two real-world datasets with block-wise missingness further illustrate the practical utility of the proposed method.

Paper Structure

This paper contains 19 sections, 6 theorems, 37 equations, 5 figures, 5 tables.

Key Result

Theorem 1

Suppose that for every pattern $s$ with positive probability under both groups, but there exists at least one $s$ such that $P_{S\mid G=X}(s)\;\neq\; P_{S\mid G=Y}(s)$. Then under $H_0$, the distributions of $T$ and $\widetilde{T}$ differ when $\{\widetilde{G}_i\}$ are generated by the standard permutation scheme. Consequently, $p$-values based on the standard permutation distr

Figures (5)

  • Figure 1: Pattern partition across three sources, with gray blocks indicating missing data. Seven nonempty patterns are shown: Pattern 1 is fully observed; Patterns 2–4 each have one missing source; Patterns 5–7 each have two missing sources.
  • Figure 2: Three common strategies for handling missing data (gray entries).
  • Figure 3: Example of $k$th-NNGs for $(\mathcal{Z}^{(\alpha)},\mathcal{Z}^{(\beta)})$ on two cases: Left panel shows the case when $\alpha = \beta$; Right panel shows the case when $\alpha \neq \beta$, inducing a bipartite graph. Colors of arrows indicate neighbor level, while colors of nodes indicate observations from different patterns.
  • Figure 4: Trend of estimated powers vs. $p$ at significance level $\theta = 0.05$, for $m = n = 100$, $L = 2$, $p_X = p_Y = p$ and $d\in\{200,500,1000\}$. Top: Setting I-a; Middle: Setting I-b; Bottom: Setting I-c.
  • Figure 5: Heatmap of sepsis biomarker data. Gray indicates missing values.

Theorems & Definitions (10)

  • Definition 1: Permutation distributions
  • Theorem 1: Failure of standard permutation under unequal missing pattern distributions
  • Theorem 2: Validity of pattern-wise permutation
  • Definition 2: $k$-NNG-induced Rank
  • Remark 1
  • Remark 2
  • Theorem 3
  • Theorem 4: Limiting distribution under the null hypothesis
  • Theorem 5: Consistency
  • Theorem 6: Consistency under HDLSS