Table of Contents
Fetching ...

MARCA: Mamba Accelerator with ReConfigurable Architecture

Jinhao Li, Shan Huang, Jiaming Xu, Jun Liu, Li Ding, Ningyi Xu, Guohao Dai

TL;DR

A Mamba accelerator with reconfigurable architecture, MARCA, is proposed and an intra-operation buffer management strategy to maximize input data sharing for linear operations within operations, and inter-operation strategy for element-wise operations between operations is proposed.

Abstract

We propose a Mamba accelerator with reconfigurable architecture, MARCA.We propose three novel approaches in this paper. (1) Reduction alternative PE array architecture for both linear and element-wise operations. For linear operations, the reduction tree connected to PE arrays is enabled and executes the reduction operation. For element-wise operations, the reduction tree is disabled and the output bypasses. (2) Reusable nonlinear function unit based on the reconfigurable PE. We decompose the exponential function into element-wise operations and a shift operation by a fast biased exponential algorithm, and the activation function (SiLU) into a range detection and element-wise operations by a piecewise approximation algorithm. Thus, the reconfigurable PEs are reused to execute nonlinear functions with negligible accuracy loss.(3) Intra-operation and inter-operation buffer management strategy. We propose intra-operation buffer management strategy to maximize input data sharing for linear operations within operations, and inter-operation strategy for element-wise operations between operations. We conduct extensive experiments on Mamba model families with different sizes.MARCA achieves up to 463.22$\times$/11.66$\times$ speedup and up to 9761.42$\times$/242.52$\times$ energy efficiency compared to Intel Xeon 8358P CPU and NVIDIA Tesla A100 GPU implementations, respectively.

MARCA: Mamba Accelerator with ReConfigurable Architecture

TL;DR

A Mamba accelerator with reconfigurable architecture, MARCA, is proposed and an intra-operation buffer management strategy to maximize input data sharing for linear operations within operations, and inter-operation strategy for element-wise operations between operations is proposed.

Abstract

We propose a Mamba accelerator with reconfigurable architecture, MARCA.We propose three novel approaches in this paper. (1) Reduction alternative PE array architecture for both linear and element-wise operations. For linear operations, the reduction tree connected to PE arrays is enabled and executes the reduction operation. For element-wise operations, the reduction tree is disabled and the output bypasses. (2) Reusable nonlinear function unit based on the reconfigurable PE. We decompose the exponential function into element-wise operations and a shift operation by a fast biased exponential algorithm, and the activation function (SiLU) into a range detection and element-wise operations by a piecewise approximation algorithm. Thus, the reconfigurable PEs are reused to execute nonlinear functions with negligible accuracy loss.(3) Intra-operation and inter-operation buffer management strategy. We propose intra-operation buffer management strategy to maximize input data sharing for linear operations within operations, and inter-operation strategy for element-wise operations between operations. We conduct extensive experiments on Mamba model families with different sizes.MARCA achieves up to 463.22/11.66 speedup and up to 9761.42/242.52 energy efficiency compared to Intel Xeon 8358P CPU and NVIDIA Tesla A100 GPU implementations, respectively.
Paper Structure (24 sections, 3 equations, 10 figures, 4 tables)

This paper contains 24 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Runtime breakdown with different sequence lengths in Mamba. Element-wise operations contribute a large fraction of the runtime with long sequence length while linear operations are dominant with short length.
  • Figure 2: Challenges in Mamba computation: (1) incompatibility between element-wise operations and Tensor Core, (2) 30% area overhead for nonlinear function unit, and (3) large memory access but limited input data sharing for element-wise operations. We propose three novel contributions in MARCA: (1) reduction alternative PE array architecture, (2) reusable nonlinear function unit, and (3) intra-operation and inter-operation buffer management strategy, to solve these challenges.
  • Figure 3: Computational flow in Mamba block and SSM. Mamba model consists of N blocks with residual connection. In SSM, $\Delta$, $B$ and $C$ are generated by input $x$. Then it performs loops to update hidden state $h$, and generates output $y$.
  • Figure 4: Left: Architecture of MARCA accelerator. MARCA mainly consists of an instruction processing, a normalization unit, an on-chip buffer,and a compute engine. Middle: Architecture of reconfigurable processing element and reduction tree in RCU. Right: Four reconfigurable modes of RCU, MM-RCU, EW-RCU, EXP-RCU, and SiLU-RCU.
  • Figure 5: Instruction set architecture with 16 32-bit general-purpose registers and 16 32-bit constant registers. All instructions are 64-bit.
  • ...and 5 more figures