Table of Contents
Fetching ...

Discovering environments with XRM

Mohammad Pezeshki, Diane Bouchacourt, Mark Ibrahim, Nicolas Ballas, Pascal Vincent, David Lopez-Paz

TL;DR

This work tackles the challenge of robust OOD generalization without human-provided environment annotations by introducing Cross-Risk Minimization (XRM), a twin-network framework that automatically discovers environments. XRM trains two classifiers on random halves of the data and forces them to imitate confident held-out mistakes, creating an echo chamber that emphasizes spurious correlations while preserving invariances. After training, a cross-mistake rule annotates all examples with environments, enabling downstream OOD methods (e.g., GroupDRO, CORAL) to achieve oracle-like worst-group accuracy across multiple benchmarks, often matching or approaching human-annotated environments. The approach is efficient, avoids early stopping, and demonstrates strong empirical gains across sub-population shifts, DomainBed, and domain generalization tasks, highlighting its practical impact for scalable, annotation-free OOD generalization.

Abstract

Environment annotations are essential for the success of many out-of-distribution (OOD) generalization methods. Unfortunately, these are costly to obtain and often limited by human annotators' biases. To achieve robust generalization, it is essential to develop algorithms for automatic environment discovery within datasets. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods introduce hyper-parameters and early-stopping criteria, which require a validation set with human-annotated environments, the very information subject to discovery. In this paper, we propose Cross-Risk-Minimization (XRM) to address this issue. XRM trains twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Algorithms built on top of XRM environments achieve oracle worst-group-accuracy, addressing a long-standing challenge in OOD generalization. Code available at \url{https://github.com/facebookresearch/XRM}.

Discovering environments with XRM

TL;DR

This work tackles the challenge of robust OOD generalization without human-provided environment annotations by introducing Cross-Risk Minimization (XRM), a twin-network framework that automatically discovers environments. XRM trains two classifiers on random halves of the data and forces them to imitate confident held-out mistakes, creating an echo chamber that emphasizes spurious correlations while preserving invariances. After training, a cross-mistake rule annotates all examples with environments, enabling downstream OOD methods (e.g., GroupDRO, CORAL) to achieve oracle-like worst-group accuracy across multiple benchmarks, often matching or approaching human-annotated environments. The approach is efficient, avoids early stopping, and demonstrates strong empirical gains across sub-population shifts, DomainBed, and domain generalization tasks, highlighting its practical impact for scalable, annotation-free OOD generalization.

Abstract

Environment annotations are essential for the success of many out-of-distribution (OOD) generalization methods. Unfortunately, these are costly to obtain and often limited by human annotators' biases. To achieve robust generalization, it is essential to develop algorithms for automatic environment discovery within datasets. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods introduce hyper-parameters and early-stopping criteria, which require a validation set with human-annotated environments, the very information subject to discovery. In this paper, we propose Cross-Risk-Minimization (XRM) to address this issue. XRM trains twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Algorithms built on top of XRM environments achieve oracle worst-group-accuracy, addressing a long-standing challenge in OOD generalization. Code available at \url{https://github.com/facebookresearch/XRM}.
Paper Structure (33 sections, 4 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 33 sections, 4 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) Waterbirds problem with four groups: a majority group of waterbirds in water, landbirds in land, waterbirds in land, and a minority group of landbirds in water. Models often rely on spurious features to classify the majority of examples and then memorize the minority examples. (b) Worst-group-accuracy (minority) for different methods. (Dotted line) ERM achieves $61\%{}$. (Dashed line) GroupDRO with human group annotations (oracle) achieves $87\%{}$. (Dashdot blue line) Prior work to discover groups requires early-stopping with surgical precision. (Solid red line) XRM enables an oracle performance of $87\%{}$ without requiring early stopping.
  • Figure 2: XRM on the Waterbirds problem, concerning waterbirds in water, waterbirds in land, landbirds in water, landbirds in land. The top-left panel shows that "percentage of XRM label-flips at convergence" is a strong indicator of "worst-group-accuracy of OOD generalization algorithm", making flips a good criterion to select twin hyper-parameters. The two bottom panels show the signed margin of the twins on each ground-truth group. Each of the bottom plots correspond to one of the classes. Note that a positive margin means correct classification. From each of these class-dependent plots, XRM discovers two environments: one for points in the "mistake-free" white area, and one for points in the "cross-mistake" gray areas. Notably, XRM is able to allocate the two smallest groups to dedicated environments. Another notable observation is that the two bottom plots appear as straight lines, indicating that the twin networks agree on their predictions. The top-right panel shows that label flipping happens almost exclusively for the two smallest groups, and stabilizes as training progresses.
  • Figure 3: Randomly selected images from CIFAR-10, identified by XRM. Although CIFAR-10 lacks predefined environment annotations, our method has successfully uncovered intriguing environments. Notably, well-classified examples (when held-out) are prototypical, featuring planes in blue skies and deer on green landscapes. In contrast, misclassified examples (when held-out) are less typical, which means they are correctly classified only when included in the training set.
  • Figure 4: Logit scatter plots for models trained on Dominoes-MF. The x-axis shows the logits with the original training examples, while the y-axis displays the logits for the same training examples with the MNIST part removed. These plots help reveal the extent of reliance on the spurious feature (MNIST digit) versus the core feature (FashionMNIST or Cifar). a) Vanilla ERM Model: Strong reliance on the spurious MNIST digit feature and partial dependence on the core feature. b) XRM Model: Strong reliance on the spurious MNIST digit. c) GroupDRO Model with XRM-Inferred annotations: point mostly on the diagonal suggesting invariance to the spurious feature, focusing only on the core feature. d) GroupDRO with ground-truth annotations for reference. We highlight that for XRM, it is desirable to only rely on the spurious feature since this will then enable the subsequent GroupDRO to learn the invariant core feature.
  • Figure 5: Hyperparameter sensitivity analysis of XRM on the Waterbirds. Here we evaluate the sensitivity of XRM's performance to variations in learning rate and weight decay hyperparameters, while keeping the batch size fixed at 512. Left: worst-group-acc of a GroupDRO model trained with XRM annotations. Each cell within the grid represents a hyperparameter combination for XRM, with the color intensity indicating the test worst-group-accuracy. Right: the percentage of labels flipped by XRM for the corresponding hyperparameter combination, serving as the model selection criterion. Analysis: The best XRM model is chosen according to the highest flip percentage (red square), which resulted in an accuracy of 87.3% (dashed red square). However, the plot on the left shows that even neighboring cells with very different hyperparameters can still lead to near-optimal performance.
  • ...and 1 more figures