Table of Contents
Fetching ...

Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization

Devansh Arpit, Huan Wang, Yingbo Zhou, Caiming Xiong

TL;DR

The paper tackles the instability of domain generalization (DG) under distribution shift by introducing a hyperparameter-free moving-average protocol (SMA) that stabilizes out-domain performance and improves the reliability of early stopping. Building on SMA, it proposes Ensemble of Averages (EoA), which ensembles moving-average models from independent runs to further boost DG, explained via a Bias-Variance adaptation where ensembles primarily reduce variance. Empirically, SMA and especially EoA achieve consistent DG gains on DomainBed across multiple backbones, with average improvements around $4$–$6\%$ over ERM and notable in-domain improvements as well; larger pre-training and model sizes yield larger gains. The work provides a practical, scalable approach with theoretical insight, plus public code, to improve cross-domain generalization in real-world settings.

Abstract

In Domain Generalization (DG) settings, models trained independently on a given set of training domains have notoriously chaotic performance on distribution shifted test domains, and stochasticity in optimization (e.g. seed) plays a big role. This makes deep learning models unreliable in real world settings. We first show that this chaotic behavior exists even along the training optimization trajectory of a single model, and propose a simple model averaging protocol that both significantly boosts domain generalization and diminishes the impact of stochasticity by improving the rank correlation between the in-domain validation accuracy and out-domain test accuracy, which is crucial for reliable early stopping. Taking advantage of our observation, we show that instead of ensembling unaveraged models (that is typical in practice), ensembling moving average models (EoA) from independent runs further boosts performance. We theoretically explain the boost in performance of ensembling and model averaging by adapting the well known Bias-Variance trade-off to the domain generalization setting. On the DomainBed benchmark, when using a pre-trained ResNet-50, this ensemble of averages achieves an average of $68.0\%$, beating vanilla ERM (w/o averaging/ensembling) by $\sim 4\%$, and when using a pre-trained RegNetY-16GF, achieves an average of $76.6\%$, beating vanilla ERM by $6\%$. Our code is available at https://github.com/salesforce/ensemble-of-averages.

Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization

TL;DR

The paper tackles the instability of domain generalization (DG) under distribution shift by introducing a hyperparameter-free moving-average protocol (SMA) that stabilizes out-domain performance and improves the reliability of early stopping. Building on SMA, it proposes Ensemble of Averages (EoA), which ensembles moving-average models from independent runs to further boost DG, explained via a Bias-Variance adaptation where ensembles primarily reduce variance. Empirically, SMA and especially EoA achieve consistent DG gains on DomainBed across multiple backbones, with average improvements around over ERM and notable in-domain improvements as well; larger pre-training and model sizes yield larger gains. The work provides a practical, scalable approach with theoretical insight, plus public code, to improve cross-domain generalization in real-world settings.

Abstract

In Domain Generalization (DG) settings, models trained independently on a given set of training domains have notoriously chaotic performance on distribution shifted test domains, and stochasticity in optimization (e.g. seed) plays a big role. This makes deep learning models unreliable in real world settings. We first show that this chaotic behavior exists even along the training optimization trajectory of a single model, and propose a simple model averaging protocol that both significantly boosts domain generalization and diminishes the impact of stochasticity by improving the rank correlation between the in-domain validation accuracy and out-domain test accuracy, which is crucial for reliable early stopping. Taking advantage of our observation, we show that instead of ensembling unaveraged models (that is typical in practice), ensembling moving average models (EoA) from independent runs further boosts performance. We theoretically explain the boost in performance of ensembling and model averaging by adapting the well known Bias-Variance trade-off to the domain generalization setting. On the DomainBed benchmark, when using a pre-trained ResNet-50, this ensemble of averages achieves an average of , beating vanilla ERM (w/o averaging/ensembling) by , and when using a pre-trained RegNetY-16GF, achieves an average of , beating vanilla ERM by . Our code is available at https://github.com/salesforce/ensemble-of-averages.

Paper Structure

This paper contains 27 sections, 5 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Model averaging improves out-domain performance stability. Left: In-domain validation accuracy and out-domain test accuracy during training of models using ERM. Right: Same as left, except validation and test predictions are made using a simple moving average of the model being optimized, along its optimization path. Details: The plots are for the TerraIncognita dataset with domain L38 used as the test domain, and others as training/validation data, and ResNet-50. Solid lines denote accuracy, dashed lines denote training loss, and dash-dot lines denote best accuracy achieved during training and all runs (for reference). Each color denotes a different run with a different random seed and training/validation split. Gist: Model averaging reduces out-domain performance instability, and makes the test curves correlate better with the validation curves, making model selection using in-domain validation set more reliable during optimization. We see a similar pattern when using ensemble of models, with and without model averaging, in Figure \ref{['fig:instability_ensemble_terra_l38']}.
  • Figure 2: Ensemble of moving averages (EoA) (right) has better out-domain test performance stability compared with ensemble of online models (left), w.r.t. in-domain validation accuracy. Details: The plots are for the TerraIncognita dataset with domain L38 used as the test domain, and others as training/validation domain, and ResNet-50. Each ensemble has 6 different models from independent runs with different random seeds, hyper-parameters, and training/validation split.
  • Figure 3: Left: Effect of ensemble size (number of models in an ensemble) on out-domain performance (mean and standard error) for models with and without moving average (MA) parameters for ResNet-50 pre-trained on ImageNet. Right: Using the performance of ensemble of size 1 (shown in the left plot) as reference, right plot shows the percentage point improvement for ensembles of size $> 1$. The plots show that i) ensemble of averages (solid lines in left plot) are consistently better than ensemble of models without averaging (dashed lines in left plot); ii) ensemble of averages consistently improves performance over averaged models (ensemble of size 1 in right plot).
  • Figure 4: The scale of terms-- moving average model's logit and the second order term in Eq. \ref{['eq_taylor_expansion']}. The latter concentrates around 0, suggesting our model averaging protocol approximates ensembles.
  • Figure 5: The impact of iteration $t_0$ at which we start simple moving averaging as described in Eq. \ref{['eq_my_sma']}, on the domain generalization performance for PACS and TerraIncognita datasets. The dominant pattern across all the experiments suggests that starting averaging earlier yields a stronger boost in performance.
  • ...and 7 more figures