Table of Contents
Fetching ...

A Bayesian encourages dropout

Shin-ichi Maeda

TL;DR

This work reframes dropout as Bayesian model averaging over submodels and demonstrates how optimizing the dropout rate improves both parameter learning and predictive quality. By formulating a variational lower bound on the marginal likelihood and employing trial distributions for masks, the authors derive a principled procedure for adjusting dropout rates, including uniform and feature-wise variants. Empirical results on a binary classification task show that per-feature dropout can approach Bayes-optimal performance, especially when a small subset of features are informative, while uniform-rate dropout may underperform in such settings. The approach provides a scalable Bayesian rationale for dropout, linking training-time regularization to improved predictive distributions and offering practical algorithms for rate optimization.

Abstract

Dropout is one of the key techniques to prevent the learning from overfitting. It is explained that dropout works as a kind of modified L2 regularization. Here, we shed light on the dropout from Bayesian standpoint. Bayesian interpretation enables us to optimize the dropout rate, which is beneficial for learning of weight parameters and prediction after learning. The experiment result also encourages the optimization of the dropout.

A Bayesian encourages dropout

TL;DR

This work reframes dropout as Bayesian model averaging over submodels and demonstrates how optimizing the dropout rate improves both parameter learning and predictive quality. By formulating a variational lower bound on the marginal likelihood and employing trial distributions for masks, the authors derive a principled procedure for adjusting dropout rates, including uniform and feature-wise variants. Empirical results on a binary classification task show that per-feature dropout can approach Bayes-optimal performance, especially when a small subset of features are informative, while uniform-rate dropout may underperform in such settings. The approach provides a scalable Bayesian rationale for dropout, linking training-time regularization to improved predictive distributions and offering practical algorithms for rate optimization.

Abstract

Dropout is one of the key techniques to prevent the learning from overfitting. It is explained that dropout works as a kind of modified L2 regularization. Here, we shed light on the dropout from Bayesian standpoint. Bayesian interpretation enables us to optimize the dropout rate, which is beneficial for learning of weight parameters and prediction after learning. The experiment result also encourages the optimization of the dropout.

Paper Structure

This paper contains 16 sections, 15 equations, 1 figure.

Figures (1)

  • Figure 1: Experimental results (a) Test accuracy (b) Dropout rate after learning