Pathologies of Predictive Diversity in Deep Ensembles

Taiga Abe; E. Kelly Buchanan; Geoff Pleiss; John P. Cunningham

Pathologies of Predictive Diversity in Deep Ensembles

Taiga Abe, E. Kelly Buchanan, Geoff Pleiss, John P. Cunningham

TL;DR

This work addresses whether predictive diversity improves ensembles of high-capacity neural networks. Through a large-scale empirical study of roughly 600 deep ensembles and multiple diversity-control mechanisms, it shows that diversity-promoting strategies often hurt large ensembles, while diversity-discouraging approaches can be benign or beneficial. A key finding is that the benefits of diversity diminish as component model capacity increases, and the best deep ensembles are typically formed from higher-capacity, less diverse components. The authors conclude that traditional diversity intuitions from low-capacity ensembles do not transfer to modern deep ensembles, and suggest focusing on more powerful component models rather than forcing diversity, with implications for training practices and resource allocation.

Abstract

Classic results establish that encouraging predictive diversity improves performance in ensembles of low-capacity models, e.g. through bagging or boosting. Here we demonstrate that these intuitions do not apply to high-capacity neural network ensembles (deep ensembles), and in fact the opposite is often true. In a large scale study of nearly 600 neural network classification ensembles, we examine a variety of interventions that trade off component model performance for predictive diversity. While such interventions can improve the performance of small neural network ensembles (in line with standard intuitions), they harm the performance of the large neural network ensembles most often used in practice. Surprisingly, we also find that discouraging predictive diversity is often benign in large-network ensembles, fully inverting standard intuitions. Even when diversity-promoting interventions do not sacrifice component model performance (e.g. using heterogeneous architectures and training paradigms), we observe an opportunity cost associated with pursuing increased predictive diversity. Examining over 1000 ensembles, we observe that the performance benefits of diverse architectures/training procedures are easily dwarfed by the benefits of simply using higher-capacity models, despite the fact that such higher capacity models often yield significantly less predictive diversity. Overall, our findings demonstrate that standard intuitions around predictive diversity, originally developed for low-capacity ensembles, do not directly apply to modern high-capacity deep ensembles. This work clarifies fundamental challenges to the goal of improving deep ensembles by making them more diverse, while suggesting an alternative path: simply forming ensembles from ever more powerful (and less diverse) component models.

Pathologies of Predictive Diversity in Deep Ensembles

TL;DR

Abstract

Paper Structure (41 sections, 9 equations, 27 figures, 6 tables)

This paper contains 41 sections, 9 equations, 27 figures, 6 tables.

Introduction
Related work
The trade-off between predictive diversity and component model performance
Manipulating predictive diversity in neural network ensembles
Results
Analysis
The best deep ensembles express low predictive diversity
Experimental setup.
Results
Analysis
Discussion
Computational resources
Model training
Code availability.
Diversity regularization experiments.
...and 26 more sections

Figures (27)

Figure 1: The diversity/component model performance trade-off. Ensemble performance, as measured as negative log likelihood (NLL), is decomposed into average single model NLL (vertical axis) and Jensen-gap predictive diversity (horizontal axis)---see \ref{['eqn:tradeoffs']}. Diagonal lines correspond to level sets of ensemble NLL (lower right is better). The performance of any ensemble can be plotted as a point on this graph ($\times$) with a corresponding level set of ensemble performance (thick diagonal line). Along a level set, all ensembles have the same NLL. There are two strategies for improving the performance of any given ensemble: increasing the predictive diversity of the component models (right arrow) or improving the average NLL of the component models (down arrow). If resulting ensembles stay below the thick diagonal, they will improve performance.
Figure 2: Encouraging/discouraging diversity has different effects on small vs. large neural network ensembles. We train deep ensembles of small (top row) and large (bottom row) neural networks (ResNet 8 and ResNet 18, respectively) with diversity mechanisms on CIFAR10 (one column per diversity mechanism). We compare standard ensemble training (dotted line) to diversity-encouraged/discouraged training (left vs. right of dotted line). Blue lines/bands are standard ensemble test accuracy; black lines/bands are component model test accuracy. Encouraging diversity ($\gamma < 0$) improves test accuracy of small-network ensembles while harming test accuracy of large-network ensembles. Conversely, discouraging diversity ($\gamma>0$) actively hurts the performance of small-network ensembles, but appears benign for large-network ensembles. (See \ref{['appx:replications']} for CIFAR100, TinyImageNet, and other architectures).
Figure 3: Trading off predictive diversity and component model performance in diversity regularized ensembles. Each marker represents a ResNet 8 (left panels) or ResNet 18 (right panels) ensemble trained with a diversity intervention on CIFAR10 (see \ref{['fig:diversity_cifar100_decomp']} for CIFAR100). Warmer colors correspond to positive $\gamma$ values (encouraging diversity), cooler colors correspond to negative $\gamma$ values (discouraging diversity). Axes are given by ensemble loss decomposition as in \ref{['fig:schematic']}. The level set of standard deep ensemble performance ($\gamma=0$) is denoted by $\times$ and the dotted diagonal line. In all ensembles, encouraging/discouraging predictive diversity leads to proportionally higher/lower diversity on test data. Small-network ensembles with diversity encouragement ($\gamma<0$, blue markers), can achieve higher single model performance, and thus better ensembles (below dotted line). For large-network ensembles however, diversity encouragement comes at a high cost to average single model performance, and worse ensembles overall (above dotted line). For small-network ensembles, discouraging predictive diversity ($\gamma>0$; red markers) also leads to worse performance of component models: thus corresponding ensemble performance is also worse (above dotted line) than standard ensembling. In contrast, diversity discouraged large-network ensembles can outperform standard training (below the dotted line). Among large-network ensembles, the one with the best test NLL is one where diversity was discouraged. In \ref{['fig:biasvar_decomp_acc']}, we also study the relationship between diversity and average accuracy of these ensembles.
Figure 4: Diversity encouragement hurts confident/accurate classifier ensembles. (Left, center:) we study the per-datapoint impact of diversity encouragement via counterfactual accuracy: the per-datapoint accuracy given by a standard ($\gamma=0$) ensemble predictions (see \ref{['sec:counterfactual']} for details). Evaluating an ensemble trained with diversity encouragement (\ref{['eqn:jensen_gap']}) produces a distribution of ensemble diversity over test set predictions (top row, one line per $\gamma$ value). Across this distribution, we then measure the counterfactual accuracy (bottom row), and focus on data in the right tail, which are the most strongly influenced by diversity encouragement techniques. Unfortunately, we see that predictions which are most influenced by diversity encouragement have high counterfactual accuracy: i.e., if we were to test a standard ensemble on these datapoints, it would have been correct anyway. This finding holds for small (ResNet 8, left) and large-network ensembles (ResNet 18, center) (for other diversity-encouraging mechanisms, see \ref{['appx:counterfactual']}.) (Right:) Decorrelating correct/confident component model predictions yields worse ensemble performance. Correct/confident component model predictions ($\times$) concentrate in a vertex of the probability simplex (top panel). Due to simplex geometry, encouraging diversity (left panel) necessarily degrades the the ensemble prediction ($\circ$), potentially altering the class prediction. In contrast, diversity discouragement (right panel) need not hurt ensemble predictions.
Figure 5: The opportunity cost of predictive diversity. Plots depict the ensemble test-set performance versus the performance of its average component model (as measured by cross entropy). Each marker is an ensemble of models evaluated on the InD/OOD datasets of CIFAR10/CINIC10 and ImageNet/ImageNet-C. Dotted boundaries indicate inset region, provided for detail. Ensembles are color coded as homogeneous (blue) or heterogeneous (orange). Controlling for component model performance (vertical slices), heterogeneous ensembles are more diverse than homogeneous ones (further from the identity line). However, heterogeneous ensembles afford diminishing amounts of predictive diversity (closer to identity line) when ensembling higher performance component models (further left). The best performing ensembles (furthest down) have the best component models (furthest left), and very little predictive diversity.
...and 22 more figures

Pathologies of Predictive Diversity in Deep Ensembles

TL;DR

Abstract

Pathologies of Predictive Diversity in Deep Ensembles

Authors

TL;DR

Abstract

Table of Contents

Figures (27)