Table of Contents
Fetching ...

Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference

Yarin Gal, Zoubin Ghahramani

TL;DR

The paper tackles CNN overfitting in data-scarce regimes by formulating a Bayesian CNN that places a kernel-level Bernoulli variational distribution, implemented via dropout. By interpreting dropout as approximate variational inference and employing Monte Carlo dropout, it achieves predictive uncertainty and robust regularization without added parameters. Empirical results on MNIST and CIFAR-10 show improved generalization and, in some architectures, state-of-the-art CIFAR-10 performance, while examining convergence and test-time trade-offs. The work connects dropout to Gaussian processes and provides a practical, low-overhead approach to Bayesian CNNs, with guidance on when MC dropout is beneficial versus standard dropout.

Abstract

Convolutional neural networks (CNNs) work well on large datasets. But labelled data is hard to collect, and in some applications larger amounts of data are not available. The problem then is how to use CNNs with small data -- as CNNs overfit quickly. We present an efficient Bayesian CNN, offering better robustness to over-fitting on small data than traditional approaches. This is by placing a probability distribution over the CNN's kernels. We approximate our model's intractable posterior with Bernoulli variational distributions, requiring no additional model parameters. On the theoretical side, we cast dropout network training as approximate inference in Bayesian neural networks. This allows us to implement our model using existing tools in deep learning with no increase in time complexity, while highlighting a negative result in the field. We show a considerable improvement in classification accuracy compared to standard techniques and improve on published state-of-the-art results for CIFAR-10.

Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference

TL;DR

The paper tackles CNN overfitting in data-scarce regimes by formulating a Bayesian CNN that places a kernel-level Bernoulli variational distribution, implemented via dropout. By interpreting dropout as approximate variational inference and employing Monte Carlo dropout, it achieves predictive uncertainty and robust regularization without added parameters. Empirical results on MNIST and CIFAR-10 show improved generalization and, in some architectures, state-of-the-art CIFAR-10 performance, while examining convergence and test-time trade-offs. The work connects dropout to Gaussian processes and provides a practical, low-overhead approach to Bayesian CNNs, with guidance on when MC dropout is beneficial versus standard dropout.

Abstract

Convolutional neural networks (CNNs) work well on large datasets. But labelled data is hard to collect, and in some applications larger amounts of data are not available. The problem then is how to use CNNs with small data -- as CNNs overfit quickly. We present an efficient Bayesian CNN, offering better robustness to over-fitting on small data than traditional approaches. This is by placing a probability distribution over the CNN's kernels. We approximate our model's intractable posterior with Bernoulli variational distributions, requiring no additional model parameters. On the theoretical side, we cast dropout network training as approximate inference in Bayesian neural networks. This allows us to implement our model using existing tools in deep learning with no increase in time complexity, while highlighting a negative result in the field. We show a considerable improvement in classification accuracy compared to standard techniques and improve on published state-of-the-art results for CIFAR-10.

Paper Structure

This paper contains 17 sections, 11 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Test error for LeNet with dropout applied after every weight layer (lenet-all -- our Bayesian CNN implementation, blue), dropout applied after the fully connected layer alone (lenet-ip, green), and without dropout (lenet-none, dotted red line). Standard dropout is shown with a dashed line, MC dropout is shown with a solid line. Note that although Standard dropout lenet-all performs very badly on both datasets (dashed blue line), when evaluating the same network with MC dropout (solid blue line) the model outperforms all others.
  • Figure 2: Test error of LeNet trained on random subsets of MNIST decreasing in size. To the left in green are networks with dropout applied after the last layer alone (lenet-ip) and evaluated with Standard dropout (the standard approach in the field), to the right in blue are networks with dropout applied after every weight layer (lenet-all) and evaluated with MC dropout -- our Bayesian CNN implementation. Note how lenet-ip starts over-fitting even with a quarter of the dataset. With a small enough dataset, both models over-fit. MC dropout was used with 10 samples.
  • Figure 3: Augmented-DSN test error for different number of averaged forward passes in MC dropout (blue) averaged with 5 repetitions, shown with 1 standard deviation. In green is test error with Standard dropout. MC dropout achieves a significant improvement (more than 1 standard deviation) after 20 samples.