Mixture Data for Training Cannot Ensure Out-of-distribution Generalization

Songming Zhang; Yuxiao Luo; Qizhou Wang; Haoang Chi; Xiaofeng Chen; Bo Han; Jinyan Li

Mixture Data for Training Cannot Ensure Out-of-distribution Generalization

Songming Zhang, Yuxiao Luo, Qizhou Wang, Haoang Chi, Xiaofeng Chen, Bo Han, Jinyan Li

TL;DR

This work questions the universal benefit of more data for out-of-distribution (OOD) generalization by introducing a convex-hull view: OOD data are defined with respect to the convex hull $Con(\mathcal{E}_s)$ of training environments, and new risk bounds distinguish data inside versus outside the hull. The authors prove that while models can generalize well to unseen data within the hull, generalization to data outside the hull cannot be guaranteed, even with increasing training size, and they uncover diverse non-monotonic error patterns across benchmarks. They validate these insights with extensive experiments on MNIST, CIFAR-10, PACS, and DomainNet, showing that OOD errors can decrease under some shifts but remain non-monotone or stable under larger shifts. To improve OOD performance, they propose a diversity-driven data selection framework (including an RL-guided sampler) that expands the training mixture's coverage without requiring target-domain labels, thereby broadening the learned representations and improving OOD generalization for large datasets. Overall, the paper provides both theoretical bounds and practical data-selection strategies that illuminate why simply adding more data does not inexorably improve OOD generalization and how to design more robust training mixtures.

Abstract

Deep neural networks often face generalization problems to handle out-of-distribution (OOD) data, and there remains a notable theoretical gap between the contributing factors and their respective impacts. Literature evidence from in-distribution data has suggested that generalization error can shrink if the size of mixture data for training increases. However, when it comes to OOD samples, this conventional understanding does not hold anymore -- Increasing the size of training data does not always lead to a reduction in the test generalization error. In fact, diverse trends of the errors have been found across various shifting scenarios including those decreasing trends under a power-law pattern, initial declines followed by increases, or continuous stable patterns. Previous work has approached OOD data qualitatively, treating them merely as samples unseen during training, which are hard to explain the complicated non-monotonic trends. In this work, we quantitatively redefine OOD data as those situated outside the convex hull of mixed training data and establish novel generalization error bounds to comprehend the counterintuitive observations better. Our proof of the new risk bound agrees that the efficacy of well-trained models can be guaranteed for unseen data within the convex hull; More interestingly, but for OOD data beyond this coverage, the generalization cannot be ensured, which aligns with our observations. Furthermore, we attempted various OOD techniques to underscore that our results not only explain insightful observations in recent OOD generalization work, such as the significance of diverse data and the sensitivity to unseen shifts of existing algorithms, but it also inspires a novel and effective data selection strategy.

Mixture Data for Training Cannot Ensure Out-of-distribution Generalization

TL;DR

This work questions the universal benefit of more data for out-of-distribution (OOD) generalization by introducing a convex-hull view: OOD data are defined with respect to the convex hull

of training environments, and new risk bounds distinguish data inside versus outside the hull. The authors prove that while models can generalize well to unseen data within the hull, generalization to data outside the hull cannot be guaranteed, even with increasing training size, and they uncover diverse non-monotonic error patterns across benchmarks. They validate these insights with extensive experiments on MNIST, CIFAR-10, PACS, and DomainNet, showing that OOD errors can decrease under some shifts but remain non-monotone or stable under larger shifts. To improve OOD performance, they propose a diversity-driven data selection framework (including an RL-guided sampler) that expands the training mixture's coverage without requiring target-domain labels, thereby broadening the learned representations and improving OOD generalization for large datasets. Overall, the paper provides both theoretical bounds and practical data-selection strategies that illuminate why simply adding more data does not inexorably improve OOD generalization and how to design more robust training mixtures.

Abstract

Paper Structure (18 sections, 2 theorems, 16 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 2 theorems, 16 equations, 9 figures, 1 table, 1 algorithm.

Introduction
Generalization error of patterns observed from OOD scenarios
Experimental Settings for OOD Evaluation
Datasets containing OOD distributions
Details for Training
Neural Architectures
Generalization error scenarios for deep learning benchmark datasets
Revisit OOD generalization problem
Formulation of the OOD generalization problem
Redefinition for OOD data
Can we break OOD limitation to improve model's capability?
Observation for widely used OOD methods
Selection of the training samples
Diversity learning for OOD generalization
Framework Details
...and 3 more sections

Key Result

Lemma 3.1

Suppose $d_{\mathcal{H}}\left[e^i,e^j\right] \leq \epsilon, \forall i, j \in\left[N_{S}\right]$, then the following inequality holds for the $\mathcal{H}$-divergence between any pair of environments $e^{\prime}, e^{\prime \prime} \in Con(\mathcal{E}_{s})$:

Figures (9)

Figure 1: A schematic diagram of a multi-domain sample in practice which consists of source and target domains. Suppose we can only have access to Painting and Photo, the model exhibits different generalization abilities at different OOD domains Cartoon and Sketch according to the distance to the mixture of training data. We draw a counterintuitive conclusion that the efficacy of well-trained models cannot be guaranteed for OOD data beyond the convex hull of training mixture, which is consistent with our experimental observations in \ref{['sec:obser']}.
Figure 2: The lower the OOD generalization error, the better the model is at handling unseen targets. Error bars indicate $95\%$ confidence intervals (10 runs). (a) Different angles $\theta_1$ as unseen samples obtained by rotating images in OOD sub-task T2 (Birdvs.Cat) in CIFAR-10, with $0^{\circ}$ and $60^{\circ}$ as training samples, $M=400$. For small $\theta_1$, increasing training data size improves the OOD generalization ability of the model. However, beyond a certain value of $\Delta_1$, the error with large rotation has a non-monotonic trend, which means overfitting on unseen rotation. (b)$2-20$ level of Gaussian blur are unseen samples, and the training blur levels are at 0 and 3, $M=400$. The model is resilient to unobserved blur, yet for extreme levels of blur, non-monotonic scenarios are evident, indicating that the model is misaligned with data due to noise. (c) Generalization error of two separate networks, WRN-10-2 and SmallConv, concerning a given unseen task. Our plots involve 3 different task pairs from Split-CIFAR10 and exhibit the generalization error as a function of the number of training samples. All 3 pairs demonstrated a non-decreasing trend in OOD generalization for both network models. (d) Generalization error of two separate datasets in CINIC-10, consisting of CIFAR-10 and ImageNet subsuet. We set one as the training environment and the other as OOD. While the purple curve shows higher error due to distribution shift, we did not observe any non-monotonic trend when testing on the unseen samples. Even when transferring between different datasets, the degree of distribution shift is still the main factor.
Figure 3: Different error trends in OOD generalization error on three DomainBed benchmarks. Left: Rotated MNIST (10 classes, $M=2,000$, SmallConv), Middle: PACS (4 classes, 4 domains {A, C, P, S}, $M=25$, WRN-16-4), Right: DomainNet (40 classes, 6 domains {paint, sketch, real, graph, clipart, draw}, $M=25$, WRN-16-4). Error bars indicate $95\%$ confidence intervals (10 runs for Rotated MNIST and PACS, 3 runs for DomainNet). As the number of training samples increases, the various distances between distributions and how they are combined lead to different decreasing trends in OOD generalization error.
Figure 4: From two benchmark datasets, we plot their OOD generalization error ($y$-axis) as a function of the OOD sample sizes per class ($M$) ($x$-axis), namely Left: a classification task from Rotated CIFAR-10, where the OOD rotation is $\theta_1 = 30^{\circ}$ and $135^{\circ}$. Right: a classification task from DomainNet with OOD environment of graph and clipart respectively. We calculate the OOD generalization error over 10 runs and 3 seeds for the two datasets respectively. We found a decrease at lower $M$ across all the pairs, and the average error is stable with a decreasing variance for larger values of $M$. Error bars indicate $95\%$ confidence intervals.
Figure 5: For $0^{\circ}$ and $60^{\circ}$ as source samples, and $135^{\circ}$ and $30^{\circ}$ as OOD samples in Rotated CIFAR-10 sub-task $T_2$ respectively, we investigate the effect of hyper-parameter tuning. We record the best set of hyper-parameters with a validation set and test it on an unseen target. It can still be observed that the same error trend in our previous results since manipulating the training set is irrelevant for the test set, and the distribution distance is the main influencing factor.
...and 4 more figures

Theorems & Definitions (10)

Lemma 3.1: Paraphrase from albuquerque2019generalizing
Definition 3.1: Out-of-distribution data (General)
Definition 3.2: Out-of-distribution data (Refined)
Remark 3.1
Remark 3.2: An intuitive explanation of OOD data definition
Theorem 3.1: Upper-bounding the risk on unseen data
proof
Remark 3.3: Intuitive interpretation of \ref{['thm:error']}
Remark 3.4: The importance of diverse data
Remark 3.5: Widely used OOD techniques

Mixture Data for Training Cannot Ensure Out-of-distribution Generalization

TL;DR

Abstract

Mixture Data for Training Cannot Ensure Out-of-distribution Generalization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (10)