Table of Contents
Fetching ...

No One Representation to Rule Them All: Overlapping Features of Training Methods

Raphael Gontijo-Lopes, Yann Dauphin, Ekin D. Cubuk

TL;DR

The paper challenges the assumption that high-accuracy models trained with supervision share biases by performing a large-scale study of 82 models across training methods, architectures, and datasets. It shows that diverging training methods yield uncorrelated errors, enabling more effective ensembles, and that models can specialize in different data subdomains and learn overlapping but non-superset feature sets. These findings explain why combining diverse, differently trained models improves both in-domain and downstream performance, and reveal that even low-accuracy models can contribute usefully when sufficiently diverse. The work advocates for embracing training-methodology diversity to expand feature coverage and improve transfer learning and ensemble gains.

Abstract

Despite being able to capture a range of features of the data, high accuracy models trained with supervision tend to make similar predictions. This seemingly implies that high-performing models share similar biases regardless of training methodology, which would limit ensembling benefits and render low-accuracy models as having little practical use. Against this backdrop, recent work has developed quite different training techniques, such as large-scale contrastive learning, yielding competitively high accuracy on generalization and robustness benchmarks. This motivates us to revisit the assumption that models necessarily learn similar functions. We conduct a large-scale empirical study of models across hyper-parameters, architectures, frameworks, and datasets. We find that model pairs that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors. We show these models specialize in subdomains of the data, leading to higher ensemble performance: with just 2 models (each with ImageNet accuracy ~76.5%), we can create ensembles with 83.4% (+7% boost). Surprisingly, we find that even significantly low-accuracy models can be used to improve high-accuracy models. Finally, we show diverging training methodology yield representations that capture overlapping (but not supersetting) feature sets which, when combined, lead to increased downstream performance.

No One Representation to Rule Them All: Overlapping Features of Training Methods

TL;DR

The paper challenges the assumption that high-accuracy models trained with supervision share biases by performing a large-scale study of 82 models across training methods, architectures, and datasets. It shows that diverging training methods yield uncorrelated errors, enabling more effective ensembles, and that models can specialize in different data subdomains and learn overlapping but non-superset feature sets. These findings explain why combining diverse, differently trained models improves both in-domain and downstream performance, and reveal that even low-accuracy models can contribute usefully when sufficiently diverse. The work advocates for embracing training-methodology diversity to expand feature coverage and improve transfer learning and ensemble gains.

Abstract

Despite being able to capture a range of features of the data, high accuracy models trained with supervision tend to make similar predictions. This seemingly implies that high-performing models share similar biases regardless of training methodology, which would limit ensembling benefits and render low-accuracy models as having little practical use. Against this backdrop, recent work has developed quite different training techniques, such as large-scale contrastive learning, yielding competitively high accuracy on generalization and robustness benchmarks. This motivates us to revisit the assumption that models necessarily learn similar functions. We conduct a large-scale empirical study of models across hyper-parameters, architectures, frameworks, and datasets. We find that model pairs that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors. We show these models specialize in subdomains of the data, leading to higher ensemble performance: with just 2 models (each with ImageNet accuracy ~76.5%), we can create ensembles with 83.4% (+7% boost). Surprisingly, we find that even significantly low-accuracy models can be used to improve high-accuracy models. Finally, we show diverging training methodology yield representations that capture overlapping (but not supersetting) feature sets which, when combined, lead to increased downstream performance.

Paper Structure

This paper contains 22 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: As training methodology diverges (Reinit $\rightarrow$ Dataset), errors become uncorrelated. Left: 'Observed error inconsistency' is the fraction of examples where only one model in the pair makes a correct prediction. Higher error inconsistency indicates the model errors are uncorrelated. Right: As models are trained with increasingly different methodologies, their error inconsistency grows, providing an opportunity for converting these examples into correct ensemble predictions.
  • Figure 2: As uncorrelated errors increase, so does ensemble performance (left, center), and so does ensemble efficiency (right). Left: Error inconsistency is linearly correlated with the ensemble performance improvement, relative to the mean accuracy of the models in the ensemble. Stars represent averages over models in each category (w/ error bars). Center: Because we limited our analysis to models in the same 74-78% accuracy range, the increase in relative accuracy translates into absolute accuracy boost, with best performing ensembles comprising of models whose training methodologies are most different. Right: Surprisingly, the conversion rate -- the rate at which these examples are converted into correct predictions by the ensemble -- also increases. This indicates that the benefits of combining divergently-trained models go beyond increasing the number of examples that can become correct predictions, to also increasing how efficiently these examples do become correct predictions.
  • Figure 3: Differently-trained models specialize: We plot histograms of examples where at least one ensemble produced error inconsistency, as a function of specialization measure $\theta$, the angle distance in the confidence-confidence plot (upper left). As we saw in Fig. \ref{['fig:comparing_models_in_ensemble']}, when model training setups diverge (Reinit $\rightarrow$ Arch $\rightarrow$ Dataset), the fraction of consistent errors decreases (upper center & right), in favor of more error inconsistency (lower center & right). This added error inconsistency comes with specialization of the models in an ensemble: when only Model 1 makes a correct prediction, it is often more confident (lower right), and vice-versa for Model 2 (lower center). Faint dotted lines indicate values of $\theta$ for which a model's top-1 prediction is likely to prevail at ensemble time.
  • Figure 4: Specialization type depends on training setup: When models differ in dataset (right plot), not only is specialization highest (see Fig. \ref{['fig:dissimilar_models_specialize']}), but also this specialization happens in different classes -- CLIP is better at anthropogenic classes (cids 500-900) than ResNet-50, which is better at nature classes (cids 0-300; Right detail). When models differ in their architecture (Center plot), they are more specialized than reinitializations (Left plot), but such specialization does not correlate with specific classes.
  • Figure 5: Lower-accuracy models can benefit high-accuracy ensembles: Left: Starting with 4 high-accuracy models (colored markers; CLIP-L, ALIGN, BiT-1k, EfficientNet-B3), we greedily select the best lower accuracy models (each with max individual accuracy 77.9%, indicated in the legend) to ensemble, and plot the ensemble's accuracy. Colors indicate which high-accuracy model the ensemble begins with, shapes indicate which models are added. By adding only lower-accuracy models to a high-accuracy one, we are able to create ensembles that reach as high as 86.7%. This shows that lower-accuracy models can be made useful, if they are diverse enough. Right: In each case, the best first model to add is one that complements the base model's training methodology.
  • ...and 8 more figures