A Bayesian encourages dropout
Shin-ichi Maeda
TL;DR
This work reframes dropout as Bayesian model averaging over submodels and demonstrates how optimizing the dropout rate improves both parameter learning and predictive quality. By formulating a variational lower bound on the marginal likelihood and employing trial distributions for masks, the authors derive a principled procedure for adjusting dropout rates, including uniform and feature-wise variants. Empirical results on a binary classification task show that per-feature dropout can approach Bayes-optimal performance, especially when a small subset of features are informative, while uniform-rate dropout may underperform in such settings. The approach provides a scalable Bayesian rationale for dropout, linking training-time regularization to improved predictive distributions and offering practical algorithms for rate optimization.
Abstract
Dropout is one of the key techniques to prevent the learning from overfitting. It is explained that dropout works as a kind of modified L2 regularization. Here, we shed light on the dropout from Bayesian standpoint. Bayesian interpretation enables us to optimize the dropout rate, which is beneficial for learning of weight parameters and prediction after learning. The experiment result also encourages the optimization of the dropout.
