Variational Dropout Sparsifies Deep Neural Networks
Dmitry Molchanov, Arsenii Ashukha, Dmitry Vetrov
TL;DR
The paper addresses overparameterization in deep neural networks by introducing Sparse Variational Dropout, which extends Variational Dropout to per-weight, unbounded dropout rates and uses a new KL divergence approximation together with Additive Noise Reparameterization to achieve extreme sparsity. This ARD-like mechanism prunes unnecessary weights while preserving accuracy, enabling massive parameter reduction (up to 280× on LeNet and 68× on VGG-like models) and substantial compression on CIFAR-10/100 with minimal performance loss. The authors provide detailed FC and convolutional layer formulations, analyze variance reduction benefits, and demonstrate robustness against memorization in random-label settings. Overall, the approach offers a principled Bayesian pathway to scalable sparsity and model compression in deep networks.
Abstract
We explore a recently proposed Variational Dropout technique that provided an elegant Bayesian interpretation to Gaussian Dropout. We extend Variational Dropout to the case when dropout rates are unbounded, propose a way to reduce the variance of the gradient estimator and report first experimental results with individual dropout rates per weight. Interestingly, it leads to extremely sparse solutions both in fully-connected and convolutional layers. This effect is similar to automatic relevance determination effect in empirical Bayes but has a number of advantages. We reduce the number of parameters up to 280 times on LeNet architectures and up to 68 times on VGG-like networks with a negligible decrease of accuracy.
