Table of Contents
Fetching ...

SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors

Mariam Rakka, Jinhao Li, Guohao Dai, Ahmed Eltawil, Mohammed E. Fouda, Fadi Kurdahi

TL;DR

This work proposes SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware, making LLMs more deployable without compromising performance.

Abstract

Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.

SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors

TL;DR

This work proposes SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware, making LLMs more deployable without compromising performance.

Abstract

Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.

Paper Structure

This paper contains 11 sections, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Softmax runtime proportion for Llama2-7b on A100 (80GB). Softmax contributes up to 38% of the run time for longer sequence lengths.
  • Figure 2: Overview of transformer block in Llama2 model.
  • Figure 3: SRAM-based AP performing XOR operation between vectors A and B, containing words of precision 2.
  • Figure 4: Approximate Softmax mapping on one AP inside one head.
  • Figure 5: AP data flow of the approximate Softmax.
  • ...and 3 more figures