Table of Contents
Fetching ...

Diversity-Aware Agnostic Ensemble of Sharpness Minimizers

Anh Bui, Vy Vo, Tung Pham, Dinh Phung, Trung Le

TL;DR

DASH is proposed - a learning algorithm that promotes diversity and flatness within deep ensembles and encourages base learners to move divergently towards low-loss regions of minimal sharpness within deep ensembles.

Abstract

There has long been plenty of theoretical and empirical evidence supporting the success of ensemble learning. Deep ensembles in particular take advantage of training randomness and expressivity of individual neural networks to gain prediction diversity, ultimately leading to better generalization, robustness and uncertainty estimation. In respect of generalization, it is found that pursuing wider local minima result in models being more robust to shifts between training and testing sets. A natural research question arises out of these two approaches as to whether a boost in generalization ability can be achieved if ensemble learning and loss sharpness minimization are integrated. Our work investigates this connection and proposes DASH - a learning algorithm that promotes diversity and flatness within deep ensembles. More concretely, DASH encourages base learners to move divergently towards low-loss regions of minimal sharpness. We provide a theoretical backbone for our method along with extensive empirical evidence demonstrating an improvement in ensemble generalizability.

Diversity-Aware Agnostic Ensemble of Sharpness Minimizers

TL;DR

DASH is proposed - a learning algorithm that promotes diversity and flatness within deep ensembles and encourages base learners to move divergently towards low-loss regions of minimal sharpness within deep ensembles.

Abstract

There has long been plenty of theoretical and empirical evidence supporting the success of ensemble learning. Deep ensembles in particular take advantage of training randomness and expressivity of individual neural networks to gain prediction diversity, ultimately leading to better generalization, robustness and uncertainty estimation. In respect of generalization, it is found that pursuing wider local minima result in models being more robust to shifts between training and testing sets. A natural research question arises out of these two approaches as to whether a boost in generalization ability can be achieved if ensemble learning and loss sharpness minimization are integrated. Our work investigates this connection and proposes DASH - a learning algorithm that promotes diversity and flatness within deep ensembles. More concretely, DASH encourages base learners to move divergently towards low-loss regions of minimal sharpness. We provide a theoretical backbone for our method along with extensive empirical evidence demonstrating an improvement in ensemble generalizability.
Paper Structure (18 sections, 1 theorem, 9 equations, 4 figures, 5 tables)

This paper contains 18 sections, 1 theorem, 9 equations, 4 figures, 5 tables.

Key Result

theorem thmcountertheorem

Assume that the loss function $\ell$ is convex and upper-bounded by $L$. With the probability at least $1-\delta$ over the choices of $\mathcal{S}\sim\mathcal{D}^{N}$, for any $0\leq\gamma\leq1$, we have where $\mathcal{H}$ is a strictly increasing function of $m$, $\rho$ and set of model parameter $\{\theta_i\}_{i=1}^m$.

Figures (4)

  • Figure 1: Tuning for hyper-parameter $\gamma$. Both the ensemble accuracy (ACC, higher is better) and the expected calibration error (ECE, lower is better) peak when $\gamma = 0.1$. See \ref{['tab:tune-gamma']} for other metrics.
  • Figure 2: Illustration of the model dynamics under sharpness-aware term on loss landscape. Two base learners $\theta_i$ and $\theta_j$ (represented by the red and black vectors respectively) happen to be initialized closely. At each step, since updated independently yet using the same mini-batch from $\theta_i$ and $\theta_j$, two perturbed models $\theta_i^a$ and $\theta_i^a$ are less diverse, hence two updated models $\theta_i$ and $\theta_j$ are also less diverse and more likely end up at the same low-loss and flat region.
  • Figure 3: Illustration of the model dynamics under diversity-aware term. Given two base learners $\theta_i$ and $\theta_j$ (represented by the red and black vectors respectively), the gradients $-\nabla_{\theta_i} \widetilde{\mathcal{L}_B}(\theta_i)$ and $-\nabla_{\theta_i} \widetilde{\mathcal{L}_B}(\theta_i)$ navigate the models towards their low-loss (also flat) regions. Moreover, the two gradients $\nabla_{\theta_i} \mathcal{L}_B^{div}(\theta_i, \theta_{\neq i})$ and $\nabla_{\theta_j} \mathcal{L}_B^{div}(\theta_j, \theta_{\neq j})$ encourage the models to move divergently. As discussed, our update strategy forces the two gradients $-\nabla_{\theta_i} \widetilde{\mathcal{L}_B}(\theta_i)$ and $\nabla_{\theta_i} \mathcal{L}_B^{div}(\theta_i, \theta_{\neq i})$ to be more congruent. As the result, two models are divergently oriented to two non-overlapping low-loss and flat regions. This behavior is imposed similarly for the other pair w.r.t. the model $\theta_j$, altogether enhancing the ensemble diversity.
  • Figure 4: Evaluation on Adversarial Robustness. The x-axis denotes the perturbation size $\epsilon$ (*255).

Theorems & Definitions (1)

  • theorem thmcountertheorem