Table of Contents
Fetching ...

Diluting Restricted Boltzmann Machines

C. Díaz-Faloh, R. Mulet

TL;DR

The paper addresses the cost and scalability concerns of large neural networks by testing the Lottery Ticket Hypothesis on Restricted Boltzmann Machines under extreme sparsity. It trains RBMs on MNIST, applies heavy pruning before and after training, and evaluates generative quality with multiple metrics, including a novel auxiliary classifier score and generalized Ising-model mappings. Key findings show RBMs can sustain high-quality generation with up to $80\%$ initial pruning, but additional pruning causes abrupt degradation, and retraining cannot fully overcome the initial learning trajectory, highlighting the importance of early pruning and initialization. These results have practical implications for designing efficient sparse architectures and emphasize the enduring influence of initial conditions on network capabilities, with potential applicability beyond RBMs to broader sparse learning regimes.

Abstract

Recent advances in artificial intelligence have relied heavily on increasingly large neural networks, raising concerns about their computational and environmental costs. This paper investigates whether simpler, sparser networks can maintain strong performance by studying Restricted Boltzmann Machines (RBMs) under extreme pruning conditions. Inspired by the Lottery Ticket Hypothesis, we demonstrate that RBMs can achieve high-quality generative performance even when up to 80% of the connections are pruned before training, confirming that they contain viable sub-networks. However, our experiments reveal crucial limitations: trained networks cannot fully recover lost performance through retraining once additional pruning is applied. We identify a sharp transition above which the generative quality degrades abruptly when pruning disrupts a minimal core of essential connections. Moreover, re-trained networks remain constrained by the parameters originally learned performing worse than networks trained from scratch at equivalent sparsity levels. These results suggest that for sparse networks to work effectively, pruning should be implemented early in training rather than attempted afterwards. Our findings provide practical insights for the development of efficient neural architectures and highlight the persistent influence of initial conditions on network capabilities.

Diluting Restricted Boltzmann Machines

TL;DR

The paper addresses the cost and scalability concerns of large neural networks by testing the Lottery Ticket Hypothesis on Restricted Boltzmann Machines under extreme sparsity. It trains RBMs on MNIST, applies heavy pruning before and after training, and evaluates generative quality with multiple metrics, including a novel auxiliary classifier score and generalized Ising-model mappings. Key findings show RBMs can sustain high-quality generation with up to initial pruning, but additional pruning causes abrupt degradation, and retraining cannot fully overcome the initial learning trajectory, highlighting the importance of early pruning and initialization. These results have practical implications for designing efficient sparse architectures and emphasize the enduring influence of initial conditions on network capabilities, with potential applicability beyond RBMs to broader sparse learning regimes.

Abstract

Recent advances in artificial intelligence have relied heavily on increasingly large neural networks, raising concerns about their computational and environmental costs. This paper investigates whether simpler, sparser networks can maintain strong performance by studying Restricted Boltzmann Machines (RBMs) under extreme pruning conditions. Inspired by the Lottery Ticket Hypothesis, we demonstrate that RBMs can achieve high-quality generative performance even when up to 80% of the connections are pruned before training, confirming that they contain viable sub-networks. However, our experiments reveal crucial limitations: trained networks cannot fully recover lost performance through retraining once additional pruning is applied. We identify a sharp transition above which the generative quality degrades abruptly when pruning disrupts a minimal core of essential connections. Moreover, re-trained networks remain constrained by the parameters originally learned performing worse than networks trained from scratch at equivalent sparsity levels. These results suggest that for sparse networks to work effectively, pruning should be implemented early in training rather than attempted afterwards. Our findings provide practical insights for the development of efficient neural architectures and highlight the persistent influence of initial conditions on network capabilities.

Paper Structure

This paper contains 10 sections, 12 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: (a) Examples generated by an RBM trained without any initial dilution. (b) Additional pruning of $p=20\%$ applied. (c) Additional pruning of $p=30\%$ applied. It can be observed that the generated images lose diversity as the additional dilution increases.
  • Figure 2: Generation quality $Q$ assigned by the additional classifier as a function of the initial pruning fraction $p_0$ used during training. Note that the $Q$ axis starts at $0.90$, so the apparent downward trend is less pronounced than it might initially seem.
  • Figure 3: (a)Generation quality of the RBMs given by the additional model (b) The curves show the distance $d_f$ between the uniform distribution of the 10 MNIST digits and the frequency of generation of each digit in the samples generated by RBMs with initial dilution $p_0$. On the x-axis, the additional pruning applied to the replicas is represented, and the curves are truncated at the value of $p$ where the generation quality drops to zero. Since it is meaningless to distinguish digits in images that are no longer classifiable as such, truncation is necessary. (c)Adversarial error EAA. (d)Error of the second moment.
  • Figure 4: (a)Rescaling of the generation quality curves for different $p_0$. The new independent variable is $p*=p-\alpha p_0$ with $\alpha=0.5$, which was the value that gave the best overlap between the curves. (b) The eigenvalues of networks with different dilution degrees are plotted in decreasing order, such that the largest eigenvalue has abscissa 1, the next 2, and so on. They are also compared with the Marchenko-Pastur (MP) distribution. All the curves corresponding to the networks are cut off at the same point around eigenvalue number 62.
  • Figure 5: The quality of generation is graphed against the pruning of a network trained under initial conditions $p_0$ and retrained under $p$$(p_0, p)$, and a network trained under initial conditions $p$$(p, 0)$. This graph is shown for four different cases of $p_0, p$. This allows for a comparison between the two training schemes. The generation quality of the $(p_0, p)$ network always drops before that of the $(p, 0)$ network.
  • ...and 2 more figures