Training Deep Boltzmann Networks with Sparse Ising Machines

Shaila Niazi; Navid Anjum Aadit; Masoud Mohseni; Shuvro Chowdhury; Yao Qin; Kerem Y. Camsari

Training Deep Boltzmann Networks with Sparse Ising Machines

Shaila Niazi, Navid Anjum Aadit, Masoud Mohseni, Shuvro Chowdhury, Yao Qin, Kerem Y. Camsari

TL;DR

The paper addresses the computational bottleneck of training deep Boltzmann networks by introducing a hybrid probabilistic-classical workflow that leverages sparse p-bit Ising machines implemented on FPGA hardware. By training hardware-aware sparse DBMs on full MNIST (and other datasets) using a graph-colored, massively parallel Gibbs sampler, they achieve ~90% MNIST accuracy with ~30k parameters (4,264 p-bits) and demonstrate image generation capabilities that fail for comparably large RBMs. The results show orders-of-magnitude speedups in sampling (≈50–64 flips/ns) and reveal critical factors like index randomization and mixing times that govern performance. Collectively, this work highlights the viability of Ising-machine–based training for deep generative models and points to near-term energy-efficient nanodevice implementations with substantial practical impact in hardware-aware AI.

Abstract

The slowing down of Moore's law has driven the development of unconventional computing paradigms, such as specialized Ising machines tailored to solve combinatorial optimization problems. In this paper, we show a new application domain for probabilistic bit (p-bit) based Ising machines by training deep generative AI models with them. Using sparse, asynchronous, and massively parallel Ising machines we train deep Boltzmann networks in a hybrid probabilistic-classical computing setup. We use the full MNIST and Fashion MNIST (FMNIST) dataset without any downsampling and a reduced version of CIFAR-10 dataset in hardware-aware network topologies implemented in moderately sized Field Programmable Gate Arrays (FPGA). For MNIST, our machine using only 4,264 nodes (p-bits) and about 30,000 parameters achieves the same classification accuracy (90%) as an optimized software-based restricted Boltzmann Machine (RBM) with approximately 3.25 million parameters. Similar results follow for FMNIST and CIFAR-10. Additionally, the sparse deep Boltzmann network can generate new handwritten digits and fashion products, a task the 3.25 million parameter RBM fails at despite achieving the same accuracy. Our hybrid computer takes a measured 50 to 64 billion probabilistic flips per second, which is at least an order of magnitude faster than superficially similar Graphics and Tensor Processing Unit (GPU/TPU) based implementations. The massively parallel architecture can comfortably perform the contrastive divergence algorithm (CD-n) with up to n = 10 million sweeps per update, beyond the capabilities of existing software implementations. These results demonstrate the potential of using Ising machines for traditionally hard-to-train deep generative Boltzmann networks, with further possible improvement in nanodevice-based realizations.

Training Deep Boltzmann Networks with Sparse Ising Machines

TL;DR

Abstract

Paper Structure (32 sections, 15 equations, 19 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 15 equations, 19 figures, 4 tables, 1 algorithm.

Introduction
A Hybrid Probabilistic-Classical Computing Scheme
Hardware-aware Sparse Networks
Training Sparse DBMs with Sparse Ising Machines
Results on the Full MNIST dataset
Image generation
Mixing times
Randomization of indices
p-computer Architecture
FPGA and CPU specifications
MNIST data, D-Wave graphs and RBM code
Data transfer between FPGA and CPU
Measurement of flips per nanosecond
FPGA Implementation
p-bit and MAC Unit
...and 17 more sections

Figures (19)

Figure 1: (a) Hybrid computing scheme with probabilistic computer and classical computer implemented on a CPU. The p-computer generates samples according to the Boltzmann-Gibbs distribution and provides them to the CPU. Then CPU computes gradients, updates the weights (J) and biases (h), and sends them back to the p-computer until convergence. (b) The p-computer illustrated here is based on digital CMOS implementation (FPGA) and can have a measured sampling speed of $\approx 50\text{ to }64$ flips/ns. (c) Nanodevice-based p-computer: Various analog implementations have been proposed chowdhury2023full. (d) Hardware-aware sparse Deep Boltzmann Machines (DBMs) are represented with visible and hidden p-bits (examples of the Pegasus dattani2019pegasus and Zephyr graphs boothby2021zephyr are shown). (e) The sparse DBMs shown in (d) are illustrated with two layers of hidden units (Left) where both the interlayer and intralayer (not shown) connections are allowed. (see Supplementary section \ref{['sec:actual']} for a full view of the networks used in this work. The graph density and vertex degree distribution of the sparse DBMs are shown in the Supplementary Section \ref{['sec:sparsity_dbm']}.) When a particular label p-bit corresponding to a digit is activated (clamping that label p-bit to 1 and clamping the rest to 0), the network evolves to an image of that digit as shown in the example (Right). (f) All 10 digits are generated with sparse DBM after training the network with the full MNIST dataset.
Figure 2: (a) MNIST accuracy vs training epochs: with sparse DBM, 90% accuracy is achieved in 100 epochs. Full MNIST (60,000 images) is trained on sparse DBM (Pegasus 4,264 p-bits) with CD-$10^{5}$, batch size = 50, learning rate = 0.003, momentum = 0.6 and epoch = 100 where the total number of parameters is 30,404. Each epoch is defined as the network seeing the entire 60,000 images with 1,200 weight updates. Test accuracy shows the accuracy of all the 10,000 images from the MNIST test set and the training accuracy represents the accuracy of 10,000 images that are randomly chosen from the training dataset. (b) MNIST accuracy with Restricted Boltzmann Machine (RBM) using 43 hidden units and CD-1 (CPU implementation) where the total number of parameters is 34,142. The accuracy of this RBM is less than 90% but sparse DBM can reach 90% with approximately the same number of parameters. (c) MNIST accuracy of RBM with 4,096 hidden units. Here the total number of parameters is 3,252,224 and the accuracy is 90% in 100 epochs which can be achieved using sparse DBM with around $100\times$ fewer parameters. (d) Test accuracy of MNIST as a function of the number of parameters with sparse DBMs (Pegasus) and RBMs. We trained full MNIST with 5 different sizes of Pegasus graphs for 100 epochs using the same set of hyperparameters and collected the test accuracy of the whole test set. When the number of parameters is only 6,464 with the smaller Pegasus (960 p-bits), test accuracy could not reach beyond 50%. On larger graphs with increased parameters, accuracy starts to increase and $\approx$ 90$\%$ accuracy is achieved with the largest Pegasus (4264 p-bits) that fits into our FPGA. RBM reached 90% accuracy with around 200,000 parameters but the increased number of parameters (up to 3.25 million) could not help go beyond $\approx 92\%$ accuracy.
Figure 3: (a) Images generated with sparse DBM by annealing the network from $\beta$ = 0 to $\beta$ = 5 with 0.125 steps after training the full MNIST dataset. The labels for a particular digit are clamped to show how the visible p-bits evolve to that specific image. Examples of digits '0' and '7' are shown here. (b) The same procedure for image generation is applied to the RBM network (with 4,096 hidden units) that achieves 90% test accuracy. Using the same annealing schedule, RBM does not produce the correct digits, unlike the sparse DBM. (c) Generated images of fashion products (e.g. 'Trouser' and 'Pullover') with sparse DBM by annealing the network from $\beta$ = 0 to $\beta$ = 5 with 0.125 steps after training full Fashion MNIST. (d) RBM with 4096 hidden units can not generate the correct images according to the labels despite achieving around 83% test accuracy.
Figure 4: (a) Test accuracy after training full MNIST (up to only 40 epochs for computational simplicity) with different numbers of sweeps per iteration is shown. For our sparse graph, to mix the Markov chain properly we need a minimum CD-$10^{4}$. Reducing the number of sweeps to $10^{3}$ or $10^{2}$ degrades the quality of mixing the chain significantly. (b) Test accuracy as a function of CD-n at epoch 40 showing the equilibrium and non-equilibrium samples.
Figure 5: (a) The sparse DBMs (Pegasus and Zephyr) where all the p-bits are distributed in a serial manner such as 1 to 784 are the visible p-bits, 785 to 834 are the label p-bits (50 bits for 5 sets of labels), and the rest are hidden p-bits. (b) The sparse DBMs with randomized indices are shown here. (c) Test accuracy of full MNIST as a function of training epochs for two different sparse DBMs. In both cases, training the sparse DBMs with the serial distribution (no randomization) of indices could not achieve an accuracy of more than 50%. In contrast, randomization of indices helps the network to reach 90% accuracy.
...and 14 more figures

Training Deep Boltzmann Networks with Sparse Ising Machines

TL;DR

Abstract

Training Deep Boltzmann Networks with Sparse Ising Machines

Authors

TL;DR

Abstract

Table of Contents

Figures (19)