Table of Contents
Fetching ...

Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration

Robin Geens, Arne Symons, Marian Verhelst

TL;DR

State Space Models (SSMs) offer linear-time, constant-memory inference for long sequences, but hardware acceleration is hampered by memory-bound prefill operations. The paper introduces fine-grained operator tiling and fusion within an extended Stream modeling framework to optimize data locality and execution of SSMs, achieving up to $4.8\times$ speedups over unfused baselines and enabling memory reductions by an order of magnitude. Through roofline analyses and a memory-aware fusion strategy, the work demonstrates that SSM accelerators can shift from memory-bound to compute-bound regimes, and that fusion-aware hardware design can outperform MARCA by up to $1.78\times$ under iso-area constraints. These insights enable faster hardware exploration and provide practical guidance for designing next-generation SSM accelerators capable of handling long sequences efficiently.

Abstract

State Space Models (SSMs) offer a promising alternative to transformers for long-sequence processing. However, their efficiency remains hindered by memory-bound operations, particularly in the prefill stage. While MARCA, a recent first effort to accelerate SSMs through a dedicated hardware accelerator, achieves great speedup over high-end GPUs, an analysis into the broader accelerator design space is lacking. This work systematically analyzes SSM acceleration opportunities both from the scheduling perspective through fine-grained operator fusion and the hardware perspective through design space exploration, using an extended version of the Stream modeling framework. Our results demonstrate that the improved data locality stemming from our optimized fusion and scheduling strategy enables a speedup of up to 4.8x over unfused execution, while our adaptive memory-aware fusion approach reduces on-chip memory requirements by an order of magnitude without sacrificing performance. We further explore accelerator design trade-offs, showing that a fusion-aware hardware architecture can achieve 1.78x higher performance than the state-of-the-art MARCA accelerator, within the same area budget. These results establish operator fusion as a key enabler for next-generation SSM accelerators.

Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration

TL;DR

State Space Models (SSMs) offer linear-time, constant-memory inference for long sequences, but hardware acceleration is hampered by memory-bound prefill operations. The paper introduces fine-grained operator tiling and fusion within an extended Stream modeling framework to optimize data locality and execution of SSMs, achieving up to speedups over unfused baselines and enabling memory reductions by an order of magnitude. Through roofline analyses and a memory-aware fusion strategy, the work demonstrates that SSM accelerators can shift from memory-bound to compute-bound regimes, and that fusion-aware hardware design can outperform MARCA by up to under iso-area constraints. These insights enable faster hardware exploration and provide practical guidance for designing next-generation SSM accelerators capable of handling long sequences efficiently.

Abstract

State Space Models (SSMs) offer a promising alternative to transformers for long-sequence processing. However, their efficiency remains hindered by memory-bound operations, particularly in the prefill stage. While MARCA, a recent first effort to accelerate SSMs through a dedicated hardware accelerator, achieves great speedup over high-end GPUs, an analysis into the broader accelerator design space is lacking. This work systematically analyzes SSM acceleration opportunities both from the scheduling perspective through fine-grained operator fusion and the hardware perspective through design space exploration, using an extended version of the Stream modeling framework. Our results demonstrate that the improved data locality stemming from our optimized fusion and scheduling strategy enables a speedup of up to 4.8x over unfused execution, while our adaptive memory-aware fusion approach reduces on-chip memory requirements by an order of magnitude without sacrificing performance. We further explore accelerator design trade-offs, showing that a fusion-aware hardware architecture can achieve 1.78x higher performance than the state-of-the-art MARCA accelerator, within the same area budget. These results establish operator fusion as a key enabler for next-generation SSM accelerators.

Paper Structure

This paper contains 37 sections, 3 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Comparison of transformer-based OPT-2.7B and SSM-based Mamba-2.8B during inference for the prefill stage (left) and decode stage (right). SSMs exhibit constant operations and memory usage during the decode stage, whereas the prefill stage presents a trade-off: lower operations but higher memory requirements compared to transformers.
  • Figure 2: Overview of this paper.
  • Figure 3: Architecture overview of transformers and SSMs. The state update block is detailed in Figure \ref{['fig:state-update']}.
  • Figure 4: Inference latency (right) for the OPT-2.7B transformer (left bars, squares) and the Mamba-2.8B SSM (right bars, diamonds) across stages and sequence lengths. Operator distribution (left) and intensity (middle) shape the total latency. The roofline is shown for the $L\!=\!2048$ case. SSMs perform worse in the prefill stage due to memory-bound state-update operators.
  • Figure 5: Dependency tracking using emulation of complex tensor manipulations between a producing and consuming operator. At the end, the exact dependencies for every individual element in the consuming operator's input are known.
  • ...and 7 more figures