Table of Contents
Fetching ...

Kernel Approximation using Analog In-Memory Computing

Julian Büchel, Giacomo Camposampiero, Athanasios Vasilopoulos, Corey Lammie, Manuel Le Gallo, Abbas Rahimi, Abu Sebastian

TL;DR

An approach to kernel approximation in machine learning algorithms suitable for mixed-signal Analog In-Memory Computing (AIMC) architectures that addresses the performance bottlenecks of conventional kernel-based methods by executing most operations in approximate kernel methods directly in memory.

Abstract

Kernel functions are vital ingredients of several machine learning algorithms, but often incur significant memory and computational costs. We introduce an approach to kernel approximation in machine learning algorithms suitable for mixed-signal Analog In-Memory Computing (AIMC) architectures. Analog In-Memory Kernel Approximation addresses the performance bottlenecks of conventional kernel-based methods by executing most operations in approximate kernel methods directly in memory. The IBM HERMES Project Chip, a state-of-the-art phase-change memory based AIMC chip, is utilized for the hardware demonstration of kernel approximation. Experimental results show that our method maintains high accuracy, with less than a 1% drop in kernel-based ridge classification benchmarks and within 1% accuracy on the Long Range Arena benchmark for kernelized attention in Transformer neural networks. Compared to traditional digital accelerators, our approach is estimated to deliver superior energy efficiency and lower power consumption. These findings highlight the potential of heterogeneous AIMC architectures to enhance the efficiency and scalability of machine learning applications.

Kernel Approximation using Analog In-Memory Computing

TL;DR

An approach to kernel approximation in machine learning algorithms suitable for mixed-signal Analog In-Memory Computing (AIMC) architectures that addresses the performance bottlenecks of conventional kernel-based methods by executing most operations in approximate kernel methods directly in memory.

Abstract

Kernel functions are vital ingredients of several machine learning algorithms, but often incur significant memory and computational costs. We introduce an approach to kernel approximation in machine learning algorithms suitable for mixed-signal Analog In-Memory Computing (AIMC) architectures. Analog In-Memory Kernel Approximation addresses the performance bottlenecks of conventional kernel-based methods by executing most operations in approximate kernel methods directly in memory. The IBM HERMES Project Chip, a state-of-the-art phase-change memory based AIMC chip, is utilized for the hardware demonstration of kernel approximation. Experimental results show that our method maintains high accuracy, with less than a 1% drop in kernel-based ridge classification benchmarks and within 1% accuracy on the Long Range Arena benchmark for kernelized attention in Transformer neural networks. Compared to traditional digital accelerators, our approach is estimated to deliver superior energy efficiency and lower power consumption. These findings highlight the potential of heterogeneous AIMC architectures to enhance the efficiency and scalability of machine learning applications.

Paper Structure

This paper contains 12 sections, 7 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: Proposed in-memory kernel approximation technique.a The two vectors $x$ and $y$ for which we want to approximate the kernel function $k$ are projected onto $m$ weight vectors $\omega_i$ that are drawn from the kernel-dependent probability distribution $p(\omega)$. After some element-wise post-processing, one obtains $z(x)$ and $z(y)$ of which the dot product represents the approximated kernel evaluated on $x$ and $y$. b We program the sampled vectors $\omega_i$ into the columns of a memristive crossbar array and perform the mapping in-memory. The element-wise post-processing and inner product calculation are performed in digital hardware. c Each weight element of the vectors $\omega$ are represented with four devices. These devices are arranged in a crossbar. In order to perform an , inputs are quantized to 8 bits and converted to voltage pulses that are then applied to the rows of the crossbar. Current proportional to the weight in the unit cell accumulates across the column, representing a dot product of the input with that column. The accumulated current is converted back to the digital domain using . The crossbar array and the peripheral circuits are integrated into one tile.
  • Figure 1: Hermes hardware experiments on the cod-rna dataset.
  • Figure 2: Experimental results for the hardware-accelerated approximations of the RBF and arc-cosine kernel. a Downstream classification accuracy experimental results. A remarkable retention of performance is obtained when the kernel approximations are deployed on the IBM HERMES Project chip, with an average empirical accuracy loss of $0.481\%$ for the kernel and $0.939\%$ for the zeroth-order arc-cosine kernel. Only one kernel (arc-cosine) on a single benchmark (EEG) shows an accuracy delta greater than $1\%$, while all the other results show a smaller gap, with almost half of them losing less than $0.5\%$ accuracy in hardware compared to the equivalent floating-point baseline. We include the delta $\Delta$ between the FP-32 precision and hardware implementation for each combination of kernel and dataset, $\Delta=\text{acc}(\text{fp})-\text{acc}(\text{hw})$. Each bar in the plot is averaged over different random seeds ($10$) and approximation techniques (, , and ), obtained for a fixed $\log(D/d)=5$. The error bars indicate the standard deviation calculated across the different random seeds, and averaged across the approximation techniques. b Comparison between the normalized approximation error in the FP-32 and the hardware implementations. The plotted error is obtained by normalizing the approximation error for each task by the maximum error obtained across different approximations and benchmarks on that same task, and then averaging it across different tasks. As expected, we observe an increase in the normalized approximation error measured for the models deployed on the hardware, which is particularly noticeable for higher log-ratios on the zeroth-order arc-cosine kernel, but also present to a lower extent in the experiments on the kernel. All the standard deviations in a and b are reported over $10$ different random seeds. The results on the downstream classification accuracy and the approximation error are reported in detail in \ref{['sup:extendedresults']}.
  • Figure 2: Hermes hardware experiments on the eeg dataset.
  • Figure 3: Schematic overview of in-memory kernel-approximation for .a After projecting the input to the query, key and value matrices for each head, they are fed into the "Approximated Scaled Dot-Product Attention" units, where $h$ is the number of heads. Inside each unit, the query and key vectors are projected into the $m-$dimensional space using , followed by the Softmax kernel specific post-processing $z$. Finally, the attention output can be calculated without explicitly calculating the attention matrix, due to the ability to re-order the matrix-matrix multiplications. This is possible because calculating the Softmax function over the attention scores is not necessary anymore. b Experimental results on the IBM HERMES Project chip show that approximating the Softmax kernel results in slightly higher approximation errors compared to FP-32. Moreover, the approximation error (measured as the distance to the exact attention matrix) becomes smaller as we increase the hidden dimension $m$.
  • ...and 17 more figures