Table of Contents
Fetching ...

Generalizing to any diverse distribution: uniformity, gentle finetuning and rebalancing

Andreas Loukas, Karolis Martinkus, Ed Wagstaff, Kyunghyun Cho

TL;DR

The first finding is that training on a uniform distribution over this domain is optimal and the theory provides a mathematical grounding for previous observations on the role of entropy and rebalancing for o.o.o.d. generalization and foundation model training.

Abstract

As training datasets grow larger, we aspire to develop models that generalize well to any diverse test distribution, even if the latter deviates significantly from the training data. Various approaches like domain adaptation, domain generalization, and robust optimization attempt to address the out-of-distribution challenge by posing assumptions about the relation between training and test distribution. Differently, we adopt a more conservative perspective by accounting for the worst-case error across all sufficiently diverse test distributions within a known domain. Our first finding is that training on a uniform distribution over this domain is optimal. We also interrogate practical remedies when uniform samples are unavailable by considering methods for mitigating non-uniformity through finetuning and rebalancing. Our theory provides a mathematical grounding for previous observations on the role of entropy and rebalancing for o.o.d. generalization and foundation model training. We also provide new empirical evidence across tasks involving o.o.d. shifts which illustrate the broad applicability of our perspective.

Generalizing to any diverse distribution: uniformity, gentle finetuning and rebalancing

TL;DR

The first finding is that training on a uniform distribution over this domain is optimal and the theory provides a mathematical grounding for previous observations on the role of entropy and rebalancing for o.o.o.d. generalization and foundation model training.

Abstract

As training datasets grow larger, we aspire to develop models that generalize well to any diverse test distribution, even if the latter deviates significantly from the training data. Various approaches like domain adaptation, domain generalization, and robust optimization attempt to address the out-of-distribution challenge by posing assumptions about the relation between training and test distribution. Differently, we adopt a more conservative perspective by accounting for the worst-case error across all sufficiently diverse test distributions within a known domain. Our first finding is that training on a uniform distribution over this domain is optimal. We also interrogate practical remedies when uniform samples are unavailable by considering methods for mitigating non-uniformity through finetuning and rebalancing. Our theory provides a mathematical grounding for previous observations on the role of entropy and rebalancing for o.o.d. generalization and foundation model training. We also provide new empirical evidence across tasks involving o.o.d. shifts which illustrate the broad applicability of our perspective.
Paper Structure (28 sections, 12 theorems, 79 equations, 8 figures, 7 tables)

This paper contains 28 sections, 12 theorems, 79 equations, 8 figures, 7 tables.

Key Result

Theorem 3.1

Consider a zero-one loss and suppose that we can train a classifier up to some fixed expected risk $\varepsilon < 1/2$ under any distribution. A classifier optimized for the uniform distribution will yield the smallest DD risk:

Figures (8)

  • Figure 1: Influence of training set size and entropy gap on DD risk $r_{\text{dd}}(f; \gamma)$ on the mixture of Gaussians task. Here the DD risk is greedily approximated by constructing adversarial test distributions that satisfy the desired entropy bound. The number of training data required to achieve a low DD risk increases sharply with the entropy gap $\gamma$ between the uniform and the test distribution, interpolating between the uniform expected risk and the worst-case risk. The adversarial test distribution risk is always below our $r_{\text{dd}}$ bound from Theorem \ref{['theorem:expected_agnostic_gap']}.
  • Figure 2: Effect of rebalancing on model error. Left: In red, we depict the area over which the model predicts the wrong label when trained without rebalancing. The black line denoted the ground-truth decision boundary. Middle: The plot shows the training set (sampled from a Gaussian distribution) and the importance weights used for rebalancing. These focus the model's attention to more sparsely sampled regions. Right: When trained with rebalancing, the model approximates more closely the ground-truth decision boundary.
  • Figure 3: The achieved DD risk is smaller for models trained on more uniform training data. The training data is drawn from a truncated Gaussian distribution with increasing standard deviation, such that the sampling becomes gradually more uniform over our sample space. As theorized, the DD risk decays for larger $\sigma$, following the trend of the uniform expected risk. rebalancing reduces uniform expected and DD risk risk (here for $\gamma = 0.99$). We use a masked auto-regressive flow $\hat{p}$ to fit the density $p$ of the training data and set $w(x) \propto \min (1 / \hat{p}(x_i)^\tau, \beta)$, with $\tau = 1$ controlling the smoothness of the weights and $\beta$ set based on a quantile of the training likelihood capping the effect of outliers. Naturally, increasing dataset size reduces DD risk. However, rebalancing remains equally beneficial across all training set sizes tested, showing that increase in data size does not remove the need for uniformity.
  • Figure 4: Log-likelihoods used for training set rebalancing as well as i.d. and o.o.d. set log-likelihood distributions. iWildCam and ColorMNIST feature covariate shift, as the density support is largely the same across all sets. The o.o.d. PovertyMap set contains a notable domain shift.
  • Figure 5: Different sampling strategies for our synthetic dataset. Points are sampled either uniformly or using a truncated Gaussian, with varying standard deviations. To label the samples, we use four randomly placed univariate Gaussian centroids and for each point $x$ assign the label $y$ (red or green) of the Gaussian with the highest likelihood. Resulting decision boundary is in black.
  • ...and 3 more figures

Theorems & Definitions (19)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 4.1
  • Theorem 4.2
  • Theorem C.1
  • proof
  • Theorem D.1
  • proof
  • Lemma D.1
  • Lemma D.1
  • ...and 9 more