Beyond Pooling: Matching for Robust Generalization under Data Heterogeneity
Ayush Roy, Rudrasis Chakraborty, Lav Varshney, Vishnu Suresh Lokhande
TL;DR
The paper tackles the problem that pooling heterogeneous domain data can amplify distributional asymmetries and hurt zero-shot generalization. It introduces a causal, centroid-based matching framework that adaptively includes domains within a radius $ au$ of a running centroid, yielding double robustness under ignorability/positivity and converging to the target distribution with reduced inter-domain variance. The authors prove asymptotic and finite-sample guarantees, extend the theory to non-Gaussian and multimodal data via normalizing flows and mode-separation arguments, and validate the approach on zero-shot medical anomaly detection, achieving stable, non-deteriorative performance as new domains are added. They also propose Variance-Aware Channel Attention (VACA) and geodesic matching on the hypersphere to tackle modality-driven clustering, achieving strong empirical gains and practical robustness in highly heterogeneous, multimodal settings.
Abstract
Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero-shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta-distributions, which are also extended to non-Gaussian and multimodal real-world settings. Most importantly, we show that these improvements translate to zero-shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on https://github.com/AyushRoy2001/Beyond-Pooling.
