Table of Contents
Fetching ...

STAR: An Efficient Softmax Engine for Attention Model with RRAM Crossbar

Yifeng Zhai, Bing Li, Bonan Yan, Jing Wang

TL;DR

STAR addresses the softmax bottleneck in attention models on RRAM crossbars by introducing a dedicated Softmax engine composed of CAM/SUB and CAM/LUT crossbars to handle $x_i-x_{max}$ and $e^{x_i-x_{max}}$. A vector-grained pipeline enables parallel execution of attention across vector units, balancing precision and hardware efficiency. Bitwidth analysis shows 8-bit, 9-bit, and 7-bit quantization schemes to maintain accuracy on CNEWS, MRPC, and CoLA. Experimental results show up to $30.63\times$ speedup over GPU and $1.31\times$ over ReTransformer, with advantageous area and power metrics, demonstrating practical, in-memory attention acceleration through crossbar-tuned softmax primitives.

Abstract

RRAM crossbars have been studied to construct in-memory accelerators for neural network applications due to their in-situ computing capability. However, prior RRAM-based accelerators show efficiency degradation when executing the popular attention models. We observed that the frequent softmax operations arise as the efficiency bottleneck and also are insensitive to computing precision. Thus, we propose STAR, which boosts the computing efficiency with an efficient RRAM-based softmax engine and a fine-grained global pipeline for the attention models. Specifically, STAR exploits the versatility and flexibility of RRAM crossbars to trade off the model accuracy and hardware efficiency. The experimental results evaluated on several datasets show STAR achieves up to 30.63x and 1.31x computing efficiency improvements over the GPU and the state-of-the-art RRAM-based attention accelerators, respectively.

STAR: An Efficient Softmax Engine for Attention Model with RRAM Crossbar

TL;DR

STAR addresses the softmax bottleneck in attention models on RRAM crossbars by introducing a dedicated Softmax engine composed of CAM/SUB and CAM/LUT crossbars to handle and . A vector-grained pipeline enables parallel execution of attention across vector units, balancing precision and hardware efficiency. Bitwidth analysis shows 8-bit, 9-bit, and 7-bit quantization schemes to maintain accuracy on CNEWS, MRPC, and CoLA. Experimental results show up to speedup over GPU and over ReTransformer, with advantageous area and power metrics, demonstrating practical, in-memory attention acceleration through crossbar-tuned softmax primitives.

Abstract

RRAM crossbars have been studied to construct in-memory accelerators for neural network applications due to their in-situ computing capability. However, prior RRAM-based accelerators show efficiency degradation when executing the popular attention models. We observed that the frequent softmax operations arise as the efficiency bottleneck and also are insensitive to computing precision. Thus, we propose STAR, which boosts the computing efficiency with an efficient RRAM-based softmax engine and a fine-grained global pipeline for the attention models. Specifically, STAR exploits the versatility and flexibility of RRAM crossbars to trade off the model accuracy and hardware efficiency. The experimental results evaluated on several datasets show STAR achieves up to 30.63x and 1.31x computing efficiency improvements over the GPU and the state-of-the-art RRAM-based attention accelerators, respectively.
Paper Structure (5 sections, 3 figures, 1 table)

This paper contains 5 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: The $x_i-x_{max}$ operation design.
  • Figure 2: The exponential operation design in our softmax engine.
  • Figure 3: Computing efficiency comparison results.