Measurement-driven neural-network training for integrated magnetic tunnel junction arrays

William A. Borders; Advait Madhavan; Matthew W. Daniels; Vasileia Georgiou; Martin Lueker-Boden; Tiffany S. Santos; Patrick M. Braganca; Mark D. Stiles; Jabez J. McClelland; Brian D. Hoskins

Measurement-driven neural-network training for integrated magnetic tunnel junction arrays

William A. Borders, Advait Madhavan, Matthew W. Daniels, Vasileia Georgiou, Martin Lueker-Boden, Tiffany S. Santos, Patrick M. Braganca, Mark D. Stiles, Jabez J. McClelland, Brian D. Hoskins

TL;DR

The paper tackles the challenge of deploying neural networks on hardware prone to device non-idealities by leveraging MTJ crossbar arrays integrated with CMOS to enable in-memory computing. It introduces defect-aware training methods and a robust statistics-aware training approach to mitigate device-to-device variation and shorts across 36 MTJ dies. Through hardware emulation on a two-layer binary network for MNIST, the authors demonstrate that defect-aware training can recover performance close to software baselines, and statistics-aware training yields robust cross-die performance with reduced sensitivity to defect locations. The work highlights the practical viability of MTJ-based neuromorphic accelerators and outlines open questions for scaling to deeper networks and optimizing training strategies under defect statistics.

Abstract

The increasing scale of neural networks needed to support more complex applications has led to an increasing requirement for area- and energy-efficient hardware. One route to meeting the budget for these applications is to circumvent the von Neumann bottleneck by performing computation in or near memory. An inevitability of transferring neural networks onto hardware is that non-idealities such as device-to-device variations or poor device yield impact performance. Methods such as hardware-aware training, where substrate non-idealities are incorporated during network training, are one way to recover performance at the cost of solution generality. In this work, we demonstrate inference on hardware neural networks consisting of 20,000 magnetic tunnel junction arrays integrated on a complementary metal-oxide-semiconductor chips that closely resembles market-ready spin transfer-torque magnetoresistive random access memory technology. Using 36 dies, each containing a crossbar array with its own non-idealities, we show that even a small number of defects in physically mapped networks significantly degrades the performance of networks trained without defects and show that, at the cost of generality, hardware-aware training accounting for specific defects on each die can recover to comparable performance with ideal networks. We then demonstrate a robust training method that extends hardware-aware training to statistics-aware training, producing network weights that perform well on most defective dies regardless of their specific defect locations. When evaluated on the 36 physical dies, statistics-aware trained solutions can achieve a mean misclassification error on the MNIST dataset that differs from the software-baseline by only 2 %. This statistics-aware training method could be generalized to networks with many layers that are mapped to hardware suited for industry-ready applications.

Measurement-driven neural-network training for integrated magnetic tunnel junction arrays

TL;DR

Abstract

Paper Structure (13 sections, 10 equations, 9 figures)

This paper contains 13 sections, 10 equations, 9 figures.

Introduction
Experimental setup
Design and test of 20,000 MTJs integrated with CMOS
Neural network architecture design and mapping onto MTJ hardware
Training the neural network
Improving classification error with hardware-aware training methods
Training and validating statistics-aware solutions
Analysis of network sensitivity
Summary and Outlook
ACKNOWLEDGEMENTS
Device-to-device variation impact on classification error
Variation of statistics-aware solution performance within each die
Yield statistics of all dies and cycle-to-cycle variation of MTJ resistance

Figures (9)

Figure 1: Fabrication and characterization of a CMOS-integrated array of 20,000 MTJs. (a) Schematic of a 2 $\times$ 2 portion of the 2T-1R crossbar array. Voltages are applied to the columns or rows while selecting a transistor column with enable lines. (b) Optical microscope image of one die containing 20,000 MTJs integrated with 40,000 transistors. 403 metallized pads for contact with probecard needles surround the array. (c) Transmission electron microscope cross-sectional image of an MTJ pillar patterned above TaN pads in contact with tungsten vias. (d) Optical microscope view of top and bottom electrodes for MTJ integration. The bottom electrode (BE) is patterned with a 3 x 3 array of MTJ vias while the top electrode (TE) is patterned with a single MTJ. Patterning the MTJ via array is an efficient process that removes extra steps during fabrication. Furthermore, the diameter of each via is two orders of magnitude larger than the MTJ pillar, producing only a trivial contribution to the measured resistance of the single MTJ. (e) Representative resistance vs. voltage curve for the MTJs used in this work. (f) Gaussian fits to the histograms for filtered $R_\text{P}$ and $R_\text{AP}$ values of MTJs in each die. (g) One example of the defect locations within a single die. Shorted devices display a constant resistance between 100 $\Omega$ and 1 k$\Omega$, while subpar MTJs switch, but with resistances ranging between 1 k$\Omega$ and 12 k$\Omega$.
Figure 2: Mapping of neural network to crossbar array hardware and inference performance on the reduced MNIST dataset. (a) Visual representation of the network architecture used for inference. Images are scaled and cropped from 28 $\times$ 28 to 10 $\times$ 10 pixel images and input to a two-layer feed-forward network. Each neuron in the output layer represents one possible handwritten digit. (b) Schematic of the neural network mapping to the MTJ crossbar array. The hardware-equivalent neural network function is the same color as the corresponding function shown in (a). Each weight is determined by the difference in conductance of two MTJs in adjacent columns and the complete network utilizes 19,800 of the 20,000 MTJs. (c) Box and whisker plot showing inference classification error of the MNIST test dataset on 36 different crossbar array dies using 100 defect-free weight matrix solutions. White boxes represent the classification error software baseline(defect-free solution) and colored boxes represent the same solutions emulated on the MTJ hardware. The horizontal line within each box indicates median error and the box and whiskers represent the 25th to 75th and 5th to 95th percentiles, respectively, for 100 unique weight matrix solutions. (d) Classification error of the defect-free solutions in (c) when defective MTJs are replaced with the mean ON and OFF MTJ conductance of the die. (e) Classification error for 100 hardware-aware solutions for each die where each solution is trained around the unique defects of each die.
Figure 3: Training of statistics-aware solutions. (a) Visual representation of training a statistics-aware solution. For each batch of input images, an identical size of defect maps are randomly generated. In this work, both the input image batch size and the defect map batch size is 100. (b)(top) Classification error of all 36 dies for $W_{\text{sat}}$ ranging from 0 to 20. Each circle represents the mean error of 100 solutions for one die assuming a single $W_{\text{sat}}$. Yellow squares represent the mean across all dies, while the dotted line represents the mean software-baseline classification error. Bold dots signify the best (blue) and worst (red) performing dies. (bottom) $\Delta$, defined as the difference in error between the software-baseline and the emulated hardware (yellow circles), and 1/$\alpha$, defined as the inverse of the coefficient of variation for hardware emulation (red circles). Error bars represent the one standard deviation width of the distribution of 36 dies. Both values are determined by ignoring the outliers shown in the above plot.
Figure 4: Histograms of $\mathcal{I}$ and $\mathcal{I}^\text{HW}$ among the layer 1 weights across 100 pure software solutions and 100 statistics-aware solutions evaluated on the training dataset. Light blue and orange curves represent the defect-free (DF) and statistics-aware (SA) solutions, respectively, determined with $L$ (defect-free loss). Navy blue and red curves represent the defect-free and statistics-aware solutions, respectively, determined from $L^\text{HW}$ (statistics-aware loss). Each curve represents the mean of 100 solutions for each bin in the histogram and the uncertainty in the mean at each value. The uncertainty in the mean for each curve is negligble, causing the mean curve and the error bands to overlap each other. All histograms are binned in an identical manner. (inset) resulting classification error for the four histograms shown, plotted on a logarithmic scale. Boxes and whiskers represent the 25th to 75th and 5th to 95th percentiles, respectively. It is important to note the high classification error for the solutions evaluated on the statistics-aware loss is the effect of choosing a value of 20 for $W_{\text{sat}}$. Actual dies contain a distribution of defect values and thus show improved performance.
Figure 5: Relative loss landscapes for defect-free and statistics-aware training cases, evaluated on the training dataset. (a,b) The relative loss landscape calculated for a defect-free trained solution (a) and a statistics-aware trained solution (b) evaluated on the defect-free loss, $L$. The relative loss is defined as the difference between the loss at each point and the absolute minimum loss obtained in each case. (c,d) The relative loss landscape for the defect-free and statistics-aware solutions, evaluated in the presence of defects, $L^\text{HW}$. (c) and (d) only consider the effects of 1 unique defect map. (e,f) The relative loss landscape for the defect-free and statistics-aware solutions, evaluated in the presence of defects, plotted as an average across 100 randomly generated defect maps. White dots represent the location of the absolute minimum and labels represent the value at the minimum. The same two random directions $\delta$ and $\eta$ defined in Eq. \ref{['eq:Loss land eq']} are used for each of the six plots. Within each training method, we also evaluate the loss landscape on an identical weight solution.
...and 4 more figures

Measurement-driven neural-network training for integrated magnetic tunnel junction arrays

TL;DR

Abstract

Measurement-driven neural-network training for integrated magnetic tunnel junction arrays

Authors

TL;DR

Abstract

Table of Contents

Figures (9)