No One Representation to Rule Them All: Overlapping Features of Training Methods

Raphael Gontijo-Lopes; Yann Dauphin; Ekin D. Cubuk

No One Representation to Rule Them All: Overlapping Features of Training Methods

Raphael Gontijo-Lopes, Yann Dauphin, Ekin D. Cubuk

TL;DR

The paper challenges the assumption that high-accuracy models trained with supervision share biases by performing a large-scale study of 82 models across training methods, architectures, and datasets. It shows that diverging training methods yield uncorrelated errors, enabling more effective ensembles, and that models can specialize in different data subdomains and learn overlapping but non-superset feature sets. These findings explain why combining diverse, differently trained models improves both in-domain and downstream performance, and reveal that even low-accuracy models can contribute usefully when sufficiently diverse. The work advocates for embracing training-methodology diversity to expand feature coverage and improve transfer learning and ensemble gains.

Abstract

Despite being able to capture a range of features of the data, high accuracy models trained with supervision tend to make similar predictions. This seemingly implies that high-performing models share similar biases regardless of training methodology, which would limit ensembling benefits and render low-accuracy models as having little practical use. Against this backdrop, recent work has developed quite different training techniques, such as large-scale contrastive learning, yielding competitively high accuracy on generalization and robustness benchmarks. This motivates us to revisit the assumption that models necessarily learn similar functions. We conduct a large-scale empirical study of models across hyper-parameters, architectures, frameworks, and datasets. We find that model pairs that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors. We show these models specialize in subdomains of the data, leading to higher ensemble performance: with just 2 models (each with ImageNet accuracy ~76.5%), we can create ensembles with 83.4% (+7% boost). Surprisingly, we find that even significantly low-accuracy models can be used to improve high-accuracy models. Finally, we show diverging training methodology yield representations that capture overlapping (but not supersetting) feature sets which, when combined, lead to increased downstream performance.

No One Representation to Rule Them All: Overlapping Features of Training Methods

TL;DR

Abstract

No One Representation to Rule Them All: Overlapping Features of Training Methods

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)