Mean-Field Assisted Deep Boltzmann Learning with Probabilistic Computers

Shuvro Chowdhury; Shaila Niazi; Kerem Y. Camsari

Mean-Field Assisted Deep Boltzmann Learning with Probabilistic Computers

Shuvro Chowdhury, Shaila Niazi, Kerem Y. Camsari

TL;DR

The paper tackles the tractability challenge of training deep, unrestricted Boltzmann machines by leveraging a fast FPGA-based p-computer to perform negative-phase sampling while introducing two mean-field theory variants (Naive and Hierarchical) to efficiently estimate positive-phase statistics. The authors propose a hybrid learning framework that uses xMFTs for the positive phase and hardware-assisted Gibbs sampling for the negative phase, enabling CD with very large n on large, sparse Ising networks. Empirically, they demonstrate training a 2-layer DBM on a Pegasus graph with 2560 p-bits to nearly MNIST-level performance (≈87% accuracy) and show that HMFT improves positive-phase correlation estimates over naive MFT, with manageable trade-offs in accuracy. This work suggests that, with dedicated probabilistic hardware and HMFT-assisted training, deep and unrestricted Boltzmann machines can be trained at scales previously deemed intractable, potentially generalizing to other Ising-machine platforms.

Abstract

Despite their appeal as physics-inspired, energy-based and generative nature, general Boltzmann Machines (BM) are considered intractable to train. This belief led to simplified models of BMs with restricted intralayer connections or layer-by-layer training of deep BMs. Recent developments in domain-specific hardware -- specifically probabilistic computers (p-computer) with probabilistic bits (p-bit) -- may change established wisdom on the tractability of deep BMs. In this paper, we show that deep and unrestricted BMs can be trained using p-computers generating hundreds of billions of Markov Chain Monte Carlo (MCMC) samples per second, on sparse networks developed originally for use in D-Wave's annealers. To maximize the efficiency of learning the p-computer, we introduce two families of Mean-Field Theory assisted learning algorithms, or xMFTs (x = Naive and Hierarchical). The xMFTs are used to estimate the averages and correlations during the positive phase of the contrastive divergence (CD) algorithm and our custom-designed p-computer is used to estimate the averages and correlations in the negative phase. A custom Field-Programmable-Gate Array (FPGA) emulation of the p-computer architecture takes up to 45 billion flips per second, allowing the implementation of CD-$n$ where $n$ can be of the order of millions, unlike RBMs where $n$ is typically 1 or 2. Experiments on the full MNIST dataset with the combined algorithm show that the positive phase can be efficiently computed by xMFTs without much degradation when the negative phase is computed by the p-computer. Our algorithm can be used in other scalable Ising machines and its variants can be used to train BMs, previously thought to be intractable.

Mean-Field Assisted Deep Boltzmann Learning with Probabilistic Computers

TL;DR

Abstract

where

can be of the order of millions, unlike RBMs where

is typically 1 or 2. Experiments on the full MNIST dataset with the combined algorithm show that the positive phase can be efficiently computed by xMFTs without much degradation when the negative phase is computed by the p-computer. Our algorithm can be used in other scalable Ising machines and its variants can be used to train BMs, previously thought to be intractable.

Paper Structure (9 sections, 6 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 9 sections, 6 equations, 4 figures, 2 tables, 2 algorithms.

Introduction
Gibbs Sampling with p-bits and Mean Field Theories
Hierarchical Mean Field Assisted CD Algorithm
Experiments
Conclusions and Outlook
HMFT vs NMFT in a toy 2-spin example
Contrastive divergence algorithm
Evolution of correlations over epochs
Log-likelihood measure for xMFT algorithms

Figures (4)

Figure 1: p-computing overview: (a) Analogy between interacting bodies in nature and interacting p-bit networks we build in this work. In stochastic MTJ (sMTJ) based implementations of p-bits, a low energy barrier magnet is used to generate natural noise. (b) Typical output of a p-bit against time fluctuating randomly between $+1$ and $-1$. (c) Input/output characteristic of a p-bit. The output (blue curve) is pinned to $\pm 1$ at strong positive and negative inputs. The average (orange) has a tanh behavior. (d) In this work, we emulate the p-bit in a digital system (FPGA) with a pseudorandom number generator (PRNG), a lookup table for the tanh and a comparator. (e) The digital emulation of the synapse with MUXes is also shown. (f) A p-computer consisting of a network of such p-bits is then realized in an FPGA.
Figure 2: Hybrid computing scheme for ML: A hybrid computing scheme with probabilistic and classical computers is shown. Inside the classical computer, the positive phase is performed with the help of mean-field theory derivative algorithms. At the beginning of the negative phase, the classical computer sends weights and biases required to our probabilistic computer (PC) where we perform Gibbs sampling. The probabilistic computer can generate a measured 45 billion Gibbs flips in a second (FPGA). The PC returns samples to the CPU which computes the gradient. This process is repeated until convergence.
Figure 3: MNIST accuracy with different methods: (a) Full MNIST (60,000 images) is trained on sparse DBM (Pegasus 2560 p-bits) with Gibbs sampling (CD-$10^{5}$) and naive MFT where batch size = 50, learning rate = 0.003, momentum = 0.6. Around 87% accuracy is achieved in 100 epochs for Gibbs sampling and 70% for the naive MFT. Test accuracy represents the accuracy of all 10,000 images from the MNIST test set, while the training accuracy corresponds to the accuracy of 10,000 images randomly drawn from the training set. (b) Training accuracy of MNIST/100 with the three different schemes: naive MFT, HMFT, and Gibbs sampling where they perform similarly. Here the batch size = 10, momentum = 0.6 and learning rate varies from 0.06 to 0.006 over 1000 epochs.
Figure S1: Typical predictions of correlations from different methods: The correlations predicted by the three different schemes - naive MFT, HMFT, and Gibbs sampling (the putative exact method since we cannot obtain exact Boltzmann correlations in general spin-glasses) are shown during a typical epoch in the training of a sparse DBM. We used a batch size of 10 images and for the positive phase, we obtained correlations by showing only one batch of 10 images. For Gibbs sampling, we used $10^4$ sweeps in both positive and negative phases. We chose a relative error tolerance of $10^{-2}$ for both MFT and HMFT. 20 bins were used in both histograms. MFT algorithms do significantly better in the positive phase than in the negative phase allowing their use in the positive phase training of deep and unrestricted BMs, instead of the more expensive Gibbs sampling.

Mean-Field Assisted Deep Boltzmann Learning with Probabilistic Computers

TL;DR

Abstract

Mean-Field Assisted Deep Boltzmann Learning with Probabilistic Computers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)