Table of Contents
Fetching ...

PASS: An Asynchronous Probabilistic Processor for Next Generation Intelligence

Saavan Patel, Philip Canoza, Adhiraj Datar, Steven Lu, Chirag Garg, Sayeef Salahuddin

TL;DR

The Parallel Asynchronous Stochastic Sampler (PASS) is demonstrated, the first fully on-chip integrated, asynchronous, probabilistic accelerator that takes advantage of the intrinsic fine-grained parallelism of the Ising Model and built in state of the art 14nm CMOS FinFET technology.

Abstract

New computing paradigms are required to solve the most challenging computational problems where no exact polynomial time solution exists.Probabilistic Ising Accelerators has gained promise on these problems with the ability to model complex probability distributions and find ground states of intractable problems. In this context, we have demonstrated the Parallel Asynchronous Stochastic Sampler (PASS), the first fully on-chip integrated, asynchronous, probabilistic accelerator that takes advantage of the intrinsic fine-grained parallelism of the Ising Model and built in state of the art 14nm CMOS FinFET technology. We have demonstrated broad applicability of this accelerator on problems ranging from Combinatorial Optimization, Neural Simulation, to Machine Learning along with up to $23,000$x energy to solution improvement compared to CPUs on probabilistic problems.

PASS: An Asynchronous Probabilistic Processor for Next Generation Intelligence

TL;DR

The Parallel Asynchronous Stochastic Sampler (PASS) is demonstrated, the first fully on-chip integrated, asynchronous, probabilistic accelerator that takes advantage of the intrinsic fine-grained parallelism of the Ising Model and built in state of the art 14nm CMOS FinFET technology.

Abstract

New computing paradigms are required to solve the most challenging computational problems where no exact polynomial time solution exists.Probabilistic Ising Accelerators has gained promise on these problems with the ability to model complex probability distributions and find ground states of intractable problems. In this context, we have demonstrated the Parallel Asynchronous Stochastic Sampler (PASS), the first fully on-chip integrated, asynchronous, probabilistic accelerator that takes advantage of the intrinsic fine-grained parallelism of the Ising Model and built in state of the art 14nm CMOS FinFET technology. We have demonstrated broad applicability of this accelerator on problems ranging from Combinatorial Optimization, Neural Simulation, to Machine Learning along with up to x energy to solution improvement compared to CPUs on probabilistic problems.
Paper Structure (28 sections, 14 equations, 15 figures, 4 tables)

This paper contains 28 sections, 14 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: . Representation of the PASS algorithm and system (A) The PASS system relies on a system of interacting spins, and has the ability to map to applications in many areas including Optimization, Machine Learning and Neural Simulation. (B) We follow a Boltzmann Machine approach, where we are looking to probabilistically sample over a system given by pairwise interacting spins. (C) Sampling proceeds by finding low energy configurations, which map to high probability states. (D) The traditional synchronous update scheme for a Boltzmann System would have each update occur on a fixed clock scheme, as shown here. This only allows one variable to update at a time, slowing down the overal update rate. (E) By following asynchronous update schemes, each neuron is not forced to use a fixed update scheme, and can update continuously based on its neighboring values. This causes a large, embarassingly parallel approach drastically speeding up convergence over the underlying sampling algorithm.
  • Figure 2: Description of Hardware Design for the PASS system (A) A circuit diagram of the neuron and connection circuitry. The neuron circuitry is formed 3 parts, a noise and amplification using amplified shot noise from a diode, a sigmoidal comparator to produce the activation, and a digitization to binarize the output. The connection circuitry is a digitla binary dot product engine, and a digital to analog conversion. (B) The individual neuron circuit produces an activation function that directly matches the expected sigmoidal activation function. We see that the activation functions well over a wide range of input voltages. This activation was directly extracted from the silicon outputs. Error bars on the sigmoid data show 95% confidence interval estimates for the activation function over 100 runs. (C), (D), (E) Raw voltage waveforms for 3 voltages (0.1V, 0.45V and 0.55V) from the sigmoidal activation. These waveforms show how the input voltage modulates the average output value, without the interaction of any input clock signal. (F), (G), (H) Post-layout images of the chip at various levels of integration. The individual neurons are composed of the binary dot product, DAC and analog subsystems, which are then integrated into small clusters, and finally into a 16x16 neuron core. The neuron core consumes 1mm x 1mm of die area, and the peripheral circuitry, I/O and fill consumes the rest of the 2mm x 2mm chip area. I) The neurons communicate with their neighbors on a kings move graph, where each neuron is connected to its nearest neighbor, as well as diagonal.
  • Figure 3: Optimization Tasks (A) Probability Distribution of MaxCUT problem, sampled from Time Series on right. Distribution shows maximums at the two solutions for the MaxCUT problem shown in the figure inset. (B),(C),(D),(E) Time Series showing fluctuations between correct states in timescale. Each node spontaneously switches states states and influences the state of neighboring nodes. (F)A MaxCUT problem encompassing the full stochastic core, with the ability model arbitrary problems. The ground state is encoded to spell out the letters C, A, and L, which the system finds with probability approaching 1. (G) Scaling simulations of the asynchronous PASS system vs. a Synchronous System on the fully connected MaxCUT problem running at the same effective clock frequency. The asynchronous PASS system shows 200x improvement at 150 nodes, with a clear scaling improvement due to usage of the asynchronous system. (H) Performance Comparison of simulated PASS system compared to state of the art systems on the 100 Node MaxCUT problem (data taken from Patel2020LogicallyFactorizationGoto2021High-performanceMechanics). The PASS simulations show the ability to solve problems 2x faster than the fastest alternate solver.
  • Figure 4: Multiplier Free Generative Machine Learning (A) Diagram of overall machine learning system with the PASS system. The host system holds the training data, which it uses to calculate the data expectation, and the PASS system computes the model expectation with the given Weights and Biases. The host system then calculates changes in weights based on this and iterates until the model has converged. None of the operations (expectations, binary outer products, averaging) require multiplications due to the binary nature of activations and the PASS stochastic activation system. (B) An example of learned digit distributions taken from the MNIST dataset. The PASS system is trained on each digit individually, these images show the average activations after being trained on the given digit. (C) After learning the digit distributions, the PASS system can perform generative modeling tasks, such as image reconstruction given partial images. The system is clamped with the top half of a digit (top figures), and the bottom represents a sampled output from the system, showing that it can effectively model the given distribution. (D) PASS is able to produce samples $180$x faster with a flat scaling resulting from the ability to fully utilize the parallelism of the PASS platform. The CPU is running. This yields an extremely power and time efficient platform for machine learning. (E) PASS is able to produce samples with a power consumption of $\approx 130x$ during the sampling run using $\approx 130$x less power for full chip simulation (222 μ W per neuron and 56.8 mW full chip vs. 7W for CPU power consumption on a single core). This yields an overall $23,400$x improvement in energy to solution to produce a given number of samples for a machine learning experiment.
  • Figure 5: Neural Decision Making in primitive brains using the PASS system (A) Diagram showing how the Ising model is mapped onto decision making in primitive fly brains. As the fly moves closer to the targets, the neurons spontaneously make a collective decision about which target to approach based on stochastic ring attractor dynamics. (B),(C),(D),(E) Showing how the neural tuning parameter $\eta$ effects the geometry of the space that the fly operates in. As $\eta$ increases, the fly makes decisions closer to the targets. Targets are placed at {0, 1000} and {1000, 1000}. (F) When choosing the tuning parameter of $\eta=1.0$ for we see that the sampled trajectories from the PASS chip (the colored dotted lines) match closely with actual trajectories from flies placed into a virtual reality environment. The heatmap shows density for actual fly trajectories placed into a virtual reality environment with two targets. (G) The PASS chip sampling trajectories for the 3 target case. Sampled trajectories show random decisions associated with fly trajectories maintaining discrete decision points associated with the targets.
  • ...and 10 more figures