Unraveling the Key Components of OOD Generalization via Diversification

Harold Benoit; Liangze Jiang; Andrei Atanov; Oğuzhan Fatih Kar; Mattia Rigotti; Amir Zamir

Unraveling the Key Components of OOD Generalization via Diversification

Harold Benoit, Liangze Jiang, Andrei Atanov, Oğuzhan Fatih Kar, Mattia Rigotti, Amir Zamir

TL;DR

This work analyzes diversification methods for improving OOD generalization under spurious correlations, showing that success hinges on the distribution of unlabeled data, the learning algorithm's inductive biases, and their interaction. It formalizes a two-stage diversification framework and compares DivDis and D-BAT, revealing that diversification alone cannot guarantee OOD generalization and that the optimal setup depends on both unlabeled data and model architecture/pretraining. Through synthetic and real-data experiments, the authors demonstrate that there exist sweet spots for unlabeled-data distributions, that increasing the number of diverse hypotheses does not rectify misalignment, and that co-dependence between data and algorithm is essential for achieving robust generalization. The findings offer concrete guidance for practitioners and point to future directions for designing more resilient diversification methods that adapt to unlabeled-data regimes and model biases.

Abstract

Supervised learning datasets may contain multiple cues that explain the training set equally well, i.e., learning any of them would lead to the correct predictions on the training data. However, many of them can be spurious, i.e., lose their predictive power under a distribution shift and consequently fail to generalize to out-of-distribution (OOD) data. Recently developed "diversification" methods (Lee et al., 2023; Pagliardini et al., 2023) approach this problem by finding multiple diverse hypotheses that rely on different features. This paper aims to study this class of methods and identify the key components contributing to their OOD generalization abilities. We show that (1) diversification methods are highly sensitive to the distribution of the unlabeled data used for diversification and can underperform significantly when away from a method-specific sweet spot. (2) Diversification alone is insufficient for OOD generalization. The choice of the used learning algorithm, e.g., the model's architecture and pretraining, is crucial. In standard experiments (classification on Waterbirds and Office-Home datasets), using the second-best choice leads to an up to 20\% absolute drop in accuracy. (3) The optimal choice of learning algorithm depends on the unlabeled data and vice versa i.e. they are co-dependent. (4) Finally, we show that, in practice, the above pitfalls cannot be alleviated by increasing the number of diverse hypotheses, the major feature of diversification methods. These findings provide a clearer understanding of the critical design factors influencing the OOD generalization abilities of diversification methods. They can guide practitioners in how to use the existing methods best and guide researchers in developing new, better ones.

Unraveling the Key Components of OOD Generalization via Diversification

TL;DR

Abstract

Paper Structure (26 sections, 3 theorems, 8 equations, 15 figures, 11 tables)

This paper contains 26 sections, 3 theorems, 8 equations, 15 figures, 11 tables.

Introduction
Related Work
Learning via Diversification
Problem Formulation
Diversification for OOD Generalization
The Relationship Between Unlabeled Data and OOD Generalization via Diversification
Theoretical and Empirical Study of a Synthetic Example
Verification on Real-World Image Data
The Relationship Between Learning Algorithm and OOD Generalization via Diversification
Diversification Alone Is Insufficient for OOD Generalization
Learning Algorithm Selection: A Key to Effective Diversification
On the Co-Dependence between Learning Algorithm and Unlabeled Data
Conclusion and Limitations
Proof and Discussion of Proposition \ref{['proposition:div_loss']}
Results for Training MLPs on 2D Task
...and 11 more sections

Key Result

Proposition 1

(On Optimal Diversification Loss) In the synthetic 2D binary task, let $h_2^{DB}$ and $h_2^{DD}$ be the second hypotheses of D-BAT and DivDis-Seq, respectively. If $r_{D_u} = 0$, then $h_2^{DB} = h^\star$ and $h_2^{DD} = h(x;\frac{\pi}{4})$. Otherwise, if $r_{D_u} = 0.5$, then $h_2^{DB} = h(x;\pi)

Figures (15)

Figure 1: Diversification is a two-legged problem where unlabeled data and learning algorithm both matter and are co-dependent.and represent the training data points and their labels. represents unlabeled data. $h_{\mathrm{ERM}}$ represents the hypothesis found by empirical risk minimization (ERM), thus reflecting the inductive bias of the learning algorithm. $h_2$ represents a second diverse hypothesis found by a diversification method; it has low error on training data as $h_{\mathrm{ERM}}$ does, but disagrees with it on the unlabeled data. Compared to (a) the original setting, we study how changing (b) unlabeled data and (c) the learning algorithm yield different solutions and, therefore, performance.
Figure 2: Performance of diversification is highly dependent on unlabeled OOD data. Left:Top-left quadrant: The 2D binary classification task. Other quadrants: Show the second hypotheses (arrows are normal vectors) found by D-BAT ($h_2^{\mathrm{DB}})$ and DivDis-Seq ($h_2^{\mathrm{DD}})$ with varied spurious ratios of unlabeled OOD data $r_{D_u}=\{0, 0.25, 0.5\}$ (from inversely correlated to balanced). Right: Best hypothesis test accuracy of D-BAT & DivDis(-Seq) on MNIST-CIFAR (M/C) for varied spurious ratios $r_{D_u}$ and number of hypotheses $K$. The test accuracy is measured on hold-out balanced data $D_{\mathrm{ood}}$ (i.e., $r_{D_{\mathrm{ood}}} = 0.5$, no spurious correlation).
Figure 3: The performance of diversification methods is highly sensitive to the choice of architecture and pretraining method.Left: DivDis and D-BAT best hypothesis ($K = 2$) performance with multiple pretraining strategies and architecture pairs on Waterbirds-CC (Left) and Office-Home (Right). ResNet50 is used if not specified. Right: Top-1 accuracy on ImageNet-1k after fine-tuning.
Figure 4: Synthetic 2D Binary Classification Task.and represent the training data points and their labels. represents unlabeled OOD data. In this setting, the unlabeled OOD data $D_u$ has spurious ratio $r_{D_u}=0$ (i.e., inversely correlated).
Figure 5: Performance of diversification is highly dependent on unlabeled OOD data (2D example + MLP). The unlabeled OOD data points are not shown in the plots. Left: the labeled training data $D_t$, the ground truth function $h^\star$ and the spurious function $h_{\mathrm{sp}}$. Right: the second hypothesis for D-BAT and DivDis-Seq under different spurious ratios of unlabeled OOD data.
...and 10 more figures

Theorems & Definitions (6)

Definition 1
Proposition 1
Proposition 2
proof
Proposition 3
proof

Unraveling the Key Components of OOD Generalization via Diversification

TL;DR

Abstract

Unraveling the Key Components of OOD Generalization via Diversification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (6)