Mining Invariance from Nonlinear Multi-Environment Data: Binary Classification
Austin Goddard, Kang Du, Yu Xiang
TL;DR
This work tackles binary classification in multi-environment settings where the data-generating process can change across environments, including interventions on the target $Y$. It introduces the Binary Invariant Matching Property (bIMP) to identify invariant representations by exploiting an invariant conditional expectation $\mathop{\mathrm{E}}_{\mathcal{P}_e}[X_k|X_S,Y]$ and an SCM-based causal interpretation; a residual distribution test is used to identify valid $(k,S)$ pairs, which are then combined to predict $Y$ in unseen environments. The proposed bIMP framework yields a practical procedure that trains two sub-models per accepted pair and aggregates predictions across pairs, with variants using linear or GAM models to capture nonlinearities. Empirical results on synthetic and real data show that bIMP provides robust generalization to unseen environments and often outperforms standard baselines such as logistic regression and invariant causal prediction, highlighting its potential for causal domain adaptation in nonlinear, mixed-type data. Overall, the paper advances invariant learning for binary outcomes by marrying causal perspective with a scalable testing-and-aggregation strategy for environment generalization.
Abstract
Making predictions in an unseen environment given data from multiple training environments is a challenging task. We approach this problem from an invariance perspective, focusing on binary classification to shed light on general nonlinear data generation mechanisms. We identify a unique form of invariance that exists solely in a binary setting that allows us to train models invariant over environments. We provide sufficient conditions for such invariance and show it is robust even when environmental conditions vary greatly. Our formulation admits a causal interpretation, allowing us to compare it with various frameworks. Finally, we propose a heuristic prediction method and conduct experiments using real and synthetic datasets.
