Table of Contents
Fetching ...

Enhancing Diversity in Bayesian Deep Learning via Hyperspherical Energy Minimization of CKA

David Smerkous, Qinxun Bai, Fuxin Li

TL;DR

The use of CKA in Bayesian deep learning is explored to generate diverse ensembles and hypernetworks that output a network posterior and the approach of hyperspherical energy on top of CKA kernels is proposed to address this drawback and improve training stability.

Abstract

Particle-based Bayesian deep learning often requires a similarity metric to compare two networks. However, naive similarity metrics lack permutation invariance and are inappropriate for comparing networks. Centered Kernel Alignment (CKA) on feature kernels has been proposed to compare deep networks but has not been used as an optimization objective in Bayesian deep learning. In this paper, we explore the use of CKA in Bayesian deep learning to generate diverse ensembles and hypernetworks that output a network posterior. Noting that CKA projects kernels onto a unit hypersphere and that directly optimizing the CKA objective leads to diminishing gradients when two networks are very similar. We propose adopting the approach of hyperspherical energy (HE) on top of CKA kernels to address this drawback and improve training stability. Additionally, by leveraging CKA-based feature kernels, we derive feature repulsive terms applied to synthetically generated outlier examples. Experiments on both diverse ensembles and hypernetworks show that our approach significantly outperforms baselines in terms of uncertainty quantification in both synthetic and realistic outlier detection tasks.

Enhancing Diversity in Bayesian Deep Learning via Hyperspherical Energy Minimization of CKA

TL;DR

The use of CKA in Bayesian deep learning is explored to generate diverse ensembles and hypernetworks that output a network posterior and the approach of hyperspherical energy on top of CKA kernels is proposed to address this drawback and improve training stability.

Abstract

Particle-based Bayesian deep learning often requires a similarity metric to compare two networks. However, naive similarity metrics lack permutation invariance and are inappropriate for comparing networks. Centered Kernel Alignment (CKA) on feature kernels has been proposed to compare deep networks but has not been used as an optimization objective in Bayesian deep learning. In this paper, we explore the use of CKA in Bayesian deep learning to generate diverse ensembles and hypernetworks that output a network posterior. Noting that CKA projects kernels onto a unit hypersphere and that directly optimizing the CKA objective leads to diminishing gradients when two networks are very similar. We propose adopting the approach of hyperspherical energy (HE) on top of CKA kernels to address this drawback and improve training stability. Additionally, by leveraging CKA-based feature kernels, we derive feature repulsive terms applied to synthetically generated outlier examples. Experiments on both diverse ensembles and hypernetworks show that our approach significantly outperforms baselines in terms of uncertainty quantification in both synthetic and realistic outlier detection tasks.

Paper Structure

This paper contains 26 sections, 1 theorem, 14 equations, 19 figures, 8 tables.

Key Result

Lemma A.1

For $\mathcal{J}(\phi) = \mathbb{E}_{x\sim q}[\log q_{\phi}(x)]$,

Figures (19)

  • Figure 1: Overview of feature repulsive loss construction: Starting with a batch of examples (left), optionally including synthetic outliers, ensemble features at each layer $l$ are used to construct centered Gram matrices projected onto the unit hypersphere (middle). The hyperspherical energy is then calculated between models, weighted by layer, and incorporated into the loss function (right).
  • Figure 2: Comparison between optimizing cosine similarity ($\mathop{\mathrm{cossim}}\nolimits$) or $\mathrm{HE}$ on a sphere. (a) initial random set of points placed on sphere. (b-c) the final set of points after 50 iterations either $\mathop{\mathrm{cossim}}\nolimits$ or $\mathrm{HE}$ as the similarity metric. (d-e) the value of $\mathop{\mathrm{cossim}}\nolimits$/$\mathrm{HE}$ with respect to the number of iterations. The orange line indicates that $\mathop{\mathrm{cossim}}\nolimits$ is minimized and the black line indicates that $\mathrm{HE}$ with $s=2$ is minimized. Both methods used gradient descent with a learning rate of $0.75$ and momentum $0.9$.
  • Figure 3: Predictive entropies (PE) on a four-cluster 2D classification task. Darker values indicate higher entropy, lower confidence regions, and lighter values indicate higher confidence regions. (b) and (d) use an RBF kernel on ensemble member weights, whereas (c) and (e) use an RBF kernel on ensemble member outputs. (f) and (g) use the $\mathop{\mathrm{HE-CKA}}\nolimits$, RBF feature kernel, for feature diversity on inlier points. Both (h) and (j) use $\mathop{\mathrm{HE-CKA}}\nolimits$ and OOD entropy terms. All methods were trained on an ensemble of 30 four layer MLPs for 1k iterations with the same seeds.
  • Figure 4: Hypernetwork $h(z)$ model architecture example on a four layer CNN
  • Figure 5: 1D regression task comparing uncertainty estimation between different approaches
  • ...and 14 more figures

Theorems & Definitions (2)

  • Lemma A.1
  • proof