A Conditional Distribution Equality Testing Framework using Deep Generative Learning
Siming Zheng, Tong Wang, Meifang Lan, Yuanyuan Lin
TL;DR
This work tackles the problem of testing equality of conditional distributions $\mathbb{P}_{1,Y|X}$ and $\mathbb{P}_{2,Y|X}$ under covariate shift and causal invariance. It introduces a general conditional-generative framework that transforms conditional testing into an unconditional two-sample test via data splitting, enabling flexible integration with neural-network–based generators. As a concrete instantiation, the authors develop the Generative Classification-Accuracy-Based Conditional Distribution Equality Test (GCA-CDET) using mixture density networks (MDNs) to learn $\mathbb{P}_{1,Y|X}$ and a classification-based test to decide equality; they prove convergence rates for the MDN generator and the testing-consistency of GCA-CDET, and demonstrate strong empirical performance on synthetic data and real datasets (Wine Quality and HIV-1 Drug Resistance). The framework is designed to handle high-dimensional covariates and imbalanced samples and can accommodate other state-of-the-art conditional generative models, with theoretical guarantees and practical evidence supporting its utility in covariate-shift and causal-invariance settings.
Abstract
In this paper, we propose a general framework for testing the conditional distribution equality in a two-sample problem, which is most relevant to covariate shift and causal discovery. Our framework is built on neural network-based generative methods and sample splitting techniques by transforming the conditional testing problem into an unconditional one. We introduce the generative classification accuracy-based conditional distribution equality test (GCA-CDET) to illustrate the proposed framework. We establish the convergence rate for the learned generator by deriving new results related to the recently-developed offset Rademacher complexity and prove the testing consistency of GCA-CDET under mild conditions.Empirically, we conduct numerical studies including synthetic datasets and two real-world datasets, demonstrating the effectiveness of our approach. Additional discussions on the optimality of the proposed framework are provided in the online supplementary material.
