Table of Contents
Fetching ...

Variational Dropout Sparsifies Deep Neural Networks

Dmitry Molchanov, Arsenii Ashukha, Dmitry Vetrov

TL;DR

The paper addresses overparameterization in deep neural networks by introducing Sparse Variational Dropout, which extends Variational Dropout to per-weight, unbounded dropout rates and uses a new KL divergence approximation together with Additive Noise Reparameterization to achieve extreme sparsity. This ARD-like mechanism prunes unnecessary weights while preserving accuracy, enabling massive parameter reduction (up to 280× on LeNet and 68× on VGG-like models) and substantial compression on CIFAR-10/100 with minimal performance loss. The authors provide detailed FC and convolutional layer formulations, analyze variance reduction benefits, and demonstrate robustness against memorization in random-label settings. Overall, the approach offers a principled Bayesian pathway to scalable sparsity and model compression in deep networks.

Abstract

We explore a recently proposed Variational Dropout technique that provided an elegant Bayesian interpretation to Gaussian Dropout. We extend Variational Dropout to the case when dropout rates are unbounded, propose a way to reduce the variance of the gradient estimator and report first experimental results with individual dropout rates per weight. Interestingly, it leads to extremely sparse solutions both in fully-connected and convolutional layers. This effect is similar to automatic relevance determination effect in empirical Bayes but has a number of advantages. We reduce the number of parameters up to 280 times on LeNet architectures and up to 68 times on VGG-like networks with a negligible decrease of accuracy.

Variational Dropout Sparsifies Deep Neural Networks

TL;DR

The paper addresses overparameterization in deep neural networks by introducing Sparse Variational Dropout, which extends Variational Dropout to per-weight, unbounded dropout rates and uses a new KL divergence approximation together with Additive Noise Reparameterization to achieve extreme sparsity. This ARD-like mechanism prunes unnecessary weights while preserving accuracy, enabling massive parameter reduction (up to 280× on LeNet and 68× on VGG-like models) and substantial compression on CIFAR-10/100 with minimal performance loss. The authors provide detailed FC and convolutional layer formulations, analyze variance reduction benefits, and demonstrate robustness against memorization in random-label settings. Overall, the approach offers a principled Bayesian pathway to scalable sparsity and model compression in deep networks.

Abstract

We explore a recently proposed Variational Dropout technique that provided an elegant Bayesian interpretation to Gaussian Dropout. We extend Variational Dropout to the case when dropout rates are unbounded, propose a way to reduce the variance of the gradient estimator and report first experimental results with individual dropout rates per weight. Interestingly, it leads to extremely sparse solutions both in fully-connected and convolutional layers. This effect is similar to automatic relevance determination effect in empirical Bayes but has a number of advantages. We reduce the number of parameters up to 280 times on LeNet architectures and up to 68 times on VGG-like networks with a negligible decrease of accuracy.

Paper Structure

This paper contains 19 sections, 20 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Different approximations of KL divergence: blue and green ones kingma2015vdo are tight only for $\alpha \leq 1$; black one is the true value, estimated by sampling; red one is our approximation.
  • Figure 2: Original parameterization vs Additive Noise Reparameterization. Additive Noise Reparameterization leads to a much faster convergence, a better value of the variational lower bound and a higher sparsity level.
  • Figure 3: Accuracy and sparsity level for VGG-like architectures of different sizes. The number of neurons and filters scales as $k$. Dense networks were trained with Binary Dropout, and Sparse VD networks were trained with Sparse Variational Dropout on all layers. The overall sparsity level, achieved by our method, is reported as a dashed line. The accuracy drop is negligible in most cases, and the sparsity level is high, especially in larger networks.