Table of Contents
Fetching ...

Hamiltonian Monte Carlo on ReLU Neural Networks is Inefficient

Vu C. Dinh, Lam Si Tung Ho, Cuong V. Nguyen

TL;DR

Due to the non-differentiability of activation functions in the ReLU family, leapfrog HMC for networks with these activation functions has a large local error rate, which leads to a higher rejection rate of the proposals, making the method inefficient.

Abstract

We analyze the error rates of the Hamiltonian Monte Carlo algorithm with leapfrog integrator for Bayesian neural network inference. We show that due to the non-differentiability of activation functions in the ReLU family, leapfrog HMC for networks with these activation functions has a large local error rate of $Ω(ε)$ rather than the classical error rate of $O(ε^3)$. This leads to a higher rejection rate of the proposals, making the method inefficient. We then verify our theoretical findings through empirical simulations as well as experiments on a real-world dataset that highlight the inefficiency of HMC inference on ReLU-based neural networks compared to analytical networks.

Hamiltonian Monte Carlo on ReLU Neural Networks is Inefficient

TL;DR

Due to the non-differentiability of activation functions in the ReLU family, leapfrog HMC for networks with these activation functions has a large local error rate, which leads to a higher rejection rate of the proposals, making the method inefficient.

Abstract

We analyze the error rates of the Hamiltonian Monte Carlo algorithm with leapfrog integrator for Bayesian neural network inference. We show that due to the non-differentiability of activation functions in the ReLU family, leapfrog HMC for networks with these activation functions has a large local error rate of rather than the classical error rate of . This leads to a higher rejection rate of the proposals, making the method inefficient. We then verify our theoretical findings through empirical simulations as well as experiments on a real-world dataset that highlight the inefficiency of HMC inference on ReLU-based neural networks compared to analytical networks.

Paper Structure

This paper contains 14 sections, 5 theorems, 49 equations, 6 figures, 2 tables.

Key Result

Theorem 3.1

If the derivatives of the potential energy function $U$ are well-defined up to the second order and are compatible with the chain rule, i.e., for all smooth functions $\phi$, then the leapfrog integrator is reversible and preserves volume. As a consequence, the HMC sampler with leapfrog integrator leaves the canonical distribution invariant.

Figures (6)

  • Figure 1: Acceptance rates of HMC with respect to the number of leapfrog steps $L$ (left) and step size $\epsilon$ (right) on BNNs with different activation functions. The decay in acceptance rates of sigmoid networks is much more moderate than those of ReLU-based networks.
  • Figure 2: Efficiency of HMC with respect to acceptance rate on BNNs with different activation functions. On both synthetic and UTKFace datasets, HMC inference with sigmoid networks is more efficient than with ReLU-based networks.
  • Figure 3: Acceptance rate of HMC with respect to the number of model parameters on shallow and deep neural networks with different activation functions. HMC on shallow networks generally has lower acceptance rates than deep networks of the same size.
  • Figure 4: Efficiency as functions of acceptance probability for symplectic integrator of second order (left, error rate $\mathcal{O}(\epsilon^2)$) and first order (right, error rate $\Theta(\epsilon)$). The $y$-axes of the graphs are presented up to unknown multiplicative constants and cannot be directly compared.
  • Figure 5: Upper bound on efficiency as functions of acceptance probability for symplectic integrator of second order (left, error rate $\mathcal{O}(\epsilon^2)$) and first order (right, error rate $\Theta(\epsilon)$). The $y$-axes of the graphs are presented up to unknown multiplicative constants and cannot be directly compared.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 3.1
  • Lemma 3.2
  • Theorem 3.3
  • Lemma 3.4
  • Proposition 3.5