Table of Contents
Fetching ...

Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models

Toluwanimi O. Odemuyiwa, John D. Owens, Joel S. Emer, Michael Pellauer

Abstract

Mamba is an emerging, complex workload with various short-range and long-range dependencies, nonlinearities, and elementwise computations that are unable to run at near-peak speeds on modern hardware. Specifically, Mamba's complex dependency graph makes fusion across its full operator cascade difficult, leaving substantial inter-operator memory traffic on the table. To address these challenges, we propose Mambalaya, a novel reconfigurable accelerator that leverages fusion to overcome the limitations of Mamba. We use the recently proposed cascade-of-Einsums abstraction to characterize Mamba's full computational structure, then apply the extended Einsum framework to systematically explore inter-Einsum fusion opportunities. This principled approach yields a series of fusion mappings that reduce off-chip inter-Einsum traffic. These mappings are supported by the underlying Mambalaya architecture. Mambalaya achieves a layer performance speedup of 4.9$\times$ for prefill and 1.9$\times$ for generation over MARCA. In prefill-dominated scenarios, it achieves up to 1.5$\times$ over a recent fine-grained, memory-aware fusion accelerator for Mamba.

Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models

Abstract

Mamba is an emerging, complex workload with various short-range and long-range dependencies, nonlinearities, and elementwise computations that are unable to run at near-peak speeds on modern hardware. Specifically, Mamba's complex dependency graph makes fusion across its full operator cascade difficult, leaving substantial inter-operator memory traffic on the table. To address these challenges, we propose Mambalaya, a novel reconfigurable accelerator that leverages fusion to overcome the limitations of Mamba. We use the recently proposed cascade-of-Einsums abstraction to characterize Mamba's full computational structure, then apply the extended Einsum framework to systematically explore inter-Einsum fusion opportunities. This principled approach yields a series of fusion mappings that reduce off-chip inter-Einsum traffic. These mappings are supported by the underlying Mambalaya architecture. Mambalaya achieves a layer performance speedup of 4.9 for prefill and 1.9 for generation over MARCA. In prefill-dominated scenarios, it achieves up to 1.5 over a recent fine-grained, memory-aware fusion accelerator for Mamba.

Paper Structure

This paper contains 37 sections, 2 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overview of the cascade execution flow for Mamba. Rounded, rectangular boxes are tensors, with the rank names (and shapes) in superscripts, and the corresponding rank variables (tensor indices) in the subscripts Odemuyiwa:2024:edge. Each Einsum is labeled with a number (yellow box) on its output tensor. Colors represent the following: (a) blue: input tensor, (b) green: GEMM with a weight tensor, (c) purple: tensor with recurrent accesses (e.g., $H_{i-1}$), (d) light orange: elementwise/broadcast operation, (e) dark grey: unary and nonlinear functions.
  • Figure 2: Overall roofline plot (a) confirms that unfused operations are memory-bound, but does not give insight into Einsums' relative weighting. Plotting detailed roofline utilization over time for a single layer demonstrates that (b) unfused prefill alternates between compute-bound and memory-bound Einsums, whereas (c) decode does not have enough reuse to reach the compute-bound in any Einsum. Phase labels in yellow correspond to the Einsum numbers in Figure \ref{['fig:cascade']}. Ideal fusion would eliminate all inter-Einsum traffic, resulting in significant increases to intensity, as shown in the bottom figures.
  • Figure 3: Examples of how each fusion class transforms an upstream iteration space to a downstream iteration space.
  • Figure 4: Unfused to RI fusion. Both Einsums share the same iteration space.
  • Figure 5: Unfused to RSb fusion. The upstream Einsum contains a rank ($K$) absent from the downstream.
  • ...and 10 more figures