Table of Contents
Fetching ...

Beyond Pooling: Matching for Robust Generalization under Data Heterogeneity

Ayush Roy, Rudrasis Chakraborty, Lav Varshney, Vishnu Suresh Lokhande

TL;DR

The paper tackles the problem that pooling heterogeneous domain data can amplify distributional asymmetries and hurt zero-shot generalization. It introduces a causal, centroid-based matching framework that adaptively includes domains within a radius $ au$ of a running centroid, yielding double robustness under ignorability/positivity and converging to the target distribution with reduced inter-domain variance. The authors prove asymptotic and finite-sample guarantees, extend the theory to non-Gaussian and multimodal data via normalizing flows and mode-separation arguments, and validate the approach on zero-shot medical anomaly detection, achieving stable, non-deteriorative performance as new domains are added. They also propose Variance-Aware Channel Attention (VACA) and geodesic matching on the hypersphere to tackle modality-driven clustering, achieving strong empirical gains and practical robustness in highly heterogeneous, multimodal settings.

Abstract

Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero-shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta-distributions, which are also extended to non-Gaussian and multimodal real-world settings. Most importantly, we show that these improvements translate to zero-shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on https://github.com/AyushRoy2001/Beyond-Pooling.

Beyond Pooling: Matching for Robust Generalization under Data Heterogeneity

TL;DR

The paper tackles the problem that pooling heterogeneous domain data can amplify distributional asymmetries and hurt zero-shot generalization. It introduces a causal, centroid-based matching framework that adaptively includes domains within a radius of a running centroid, yielding double robustness under ignorability/positivity and converging to the target distribution with reduced inter-domain variance. The authors prove asymptotic and finite-sample guarantees, extend the theory to non-Gaussian and multimodal data via normalizing flows and mode-separation arguments, and validate the approach on zero-shot medical anomaly detection, achieving stable, non-deteriorative performance as new domains are added. They also propose Variance-Aware Channel Attention (VACA) and geodesic matching on the hypersphere to tackle modality-driven clustering, achieving strong empirical gains and practical robustness in highly heterogeneous, multimodal settings.

Abstract

Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero-shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta-distributions, which are also extended to non-Gaussian and multimodal real-world settings. Most importantly, we show that these improvements translate to zero-shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on https://github.com/AyushRoy2001/Beyond-Pooling.
Paper Structure (38 sections, 11 theorems, 11 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 38 sections, 11 theorems, 11 equations, 8 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $\{\boldsymbol{\mu}_k\}_{k=1}^K \overset{i.i.d.}{\sim} \mathscr{D}_{\mu}$ with $\mathbb{E}[\boldsymbol{\mu}_k] = \boldsymbol{\mu}_*$ and $\mathrm{Cov}(\boldsymbol{\mu}_k) = \Sigma_{\mu}$. Suppose each domain $Q_k$ generates i.i.d. samples $\mathbf{x}_{k,i} \sim \mathcal{N}(\boldsymbol{\mu}_k, \s

Figures (8)

  • Figure 1: Comparison of the proposed method with MVFA huang2024adapting, AnomalyCLIP zhou2023anomalyclip, and BiLORA zhu2025bilora. Performance is reported in terms of domain alignment (DA), anomaly classification (AC), and anomaly segmentation (AS). The x-axis represents the sequential domain addition, and the y-axis represents the AUC scores. Comparison of Domain Alignment (DA) scores across all datasets for the Base, Agnostic, and GeoDVar methods. MVFA huang2024adapting shows moderate alignment with DA scores clustered around 2 (e.g., 2.2365 for HIS, 2.1146 for Chest-XRay, 2.0501 for OCT17, 2.0466 for Brain MRI AC, 2.0335 for Liver CT AC). AnomalyCLIP zhou2023anomalyclip yields inconsistent and often poorer alignment, with scores varying widely from 1.0102 to 3.2632 across different tasks. BiLORA zhu2025bilora achieved better DA scores than MVFA huang2024adapting and AnomalyCLIP zhou2023anomalyclip (HIS AC-4.0, ChestXray AC-3.145, OCT17 AC-4.0, BrainMRI AC-4.0, BrainMRI AS-4.012, LiverCT AC-3.158, LiverCT AS-4.030, RESC AC-3.081, RESC AS-4.072). In contrast, our method achieves consistently superior domain alignment, with DA scores at or exceeding 4.0 for all datasets and tasks, including peaks of 4.0736 for Brain MRI detection (AC) and 4.0484 for Brain MRI segmentation (AS), demonstrating its robust performance in matching complex domain distributions.
  • Figure 2: Ablation comparing different matching metrics $M$: Euclidean distance (CS_L2), cosine similarity (CS_Cosine), geodesic distance (CS_Geodesic), and geodesic distance with VACA (CS_GeodVar). Performance is reported in terms of domain alignment (DA), anomaly classification (AC), and anomaly segmentation (AS). X axis represents the sequential domain addition and the Y axis represents the AUC scores. CS_L2 and CS_Cosine achieves DA of 4 for BrainMRI, LiverCT, and ChestXRay. CS_Geo achieves DA of 4 for ChestXRay and LiverCT while achieving DA of 4.0138 for BrainMRI detection (AC) and 4.0081 for BrainMRI segmentation (AS). CS_GeodVar achieves DA of 4.0134 for ChestXRay, 4.0163 for LiverCT detection (AC), 4.0010 for LiverCT segmentation (AS), 4.0736 for BrainMRI detection (AC) and 4.0484 for BrainMRI segmentation (AS).
  • Figure 3: Illustration of modality-induced clustering in CLIP feature space. Embeddings of distinct modalities form disjoint angular regions on the hypersphere (see Table \ref{['tab:dataset_similarity']} in Supplementary Sec. \ref{['sec:modality_clusters']} for more details).
  • Figure 4: Asymmetric setting: domain means (colored dots) concentrate along a biased direction, away from the target mean $\boldsymbol{\mu}_*=\mathbf{0}$ (red star).
  • Figure 5: Asymptotic behavior (Theorem \ref{['thm:asymptotic_behavior']}). Left: bias $\epsilon_K=\|\bar{\boldsymbol{\mu}}_K-\boldsymbol{\mu}_*\|_2$ versus number of domains $K\in\{5,10,20,30,40,50\}$. Right: final bias at $K=50$. Matching achieves the smallest bias as $K$ grows, in line with its convergence to $\mathcal{N}(\boldsymbol{\mu}_*,\sigma^2\mathbf{I}_d)$, while other strategies retain inter-domain variance.
  • ...and 3 more figures

Theorems & Definitions (24)

  • Definition 1: Naive Pooling
  • Definition 2: Uniform Subsampling
  • Definition 3: Matching
  • Theorem 1: Asymptotic Behavior as $K \to \infty$
  • Definition 4: Distributional Symmetry
  • Example 1
  • Theorem 2: Finite-$K$ Unbiasedness under Symmetry
  • Theorem 3: Finite-Sample Robustness under Domain Addition
  • Theorem 4: Normalizing Flow Transport Theorem with Lipschitz Guarantees
  • Definition 5: Multimodal Data Distribution
  • ...and 14 more