Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration
Robin Geens, Arne Symons, Marian Verhelst
TL;DR
State Space Models (SSMs) offer linear-time, constant-memory inference for long sequences, but hardware acceleration is hampered by memory-bound prefill operations. The paper introduces fine-grained operator tiling and fusion within an extended Stream modeling framework to optimize data locality and execution of SSMs, achieving up to $4.8\times$ speedups over unfused baselines and enabling memory reductions by an order of magnitude. Through roofline analyses and a memory-aware fusion strategy, the work demonstrates that SSM accelerators can shift from memory-bound to compute-bound regimes, and that fusion-aware hardware design can outperform MARCA by up to $1.78\times$ under iso-area constraints. These insights enable faster hardware exploration and provide practical guidance for designing next-generation SSM accelerators capable of handling long sequences efficiently.
Abstract
State Space Models (SSMs) offer a promising alternative to transformers for long-sequence processing. However, their efficiency remains hindered by memory-bound operations, particularly in the prefill stage. While MARCA, a recent first effort to accelerate SSMs through a dedicated hardware accelerator, achieves great speedup over high-end GPUs, an analysis into the broader accelerator design space is lacking. This work systematically analyzes SSM acceleration opportunities both from the scheduling perspective through fine-grained operator fusion and the hardware perspective through design space exploration, using an extended version of the Stream modeling framework. Our results demonstrate that the improved data locality stemming from our optimized fusion and scheduling strategy enables a speedup of up to 4.8x over unfused execution, while our adaptive memory-aware fusion approach reduces on-chip memory requirements by an order of magnitude without sacrificing performance. We further explore accelerator design trade-offs, showing that a fusion-aware hardware architecture can achieve 1.78x higher performance than the state-of-the-art MARCA accelerator, within the same area budget. These results establish operator fusion as a key enabler for next-generation SSM accelerators.
