Table of Contents
Fetching ...

DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

Alish Kanani, Sangwan Lee, Han Lyu, Jiahao Lin, Jaehyun Park, Umit Y. Ogras

Abstract

Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.

DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

Abstract

Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.
Paper Structure (13 sections, 1 equation, 6 figures, 5 tables)

This paper contains 13 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Roofline analysis of Mamba and Transformer layers from Nemotron-H-56B blakeman2025nemotron as function of batch size (B) on Nvidia Blackwell B200 (2.25 PFLOP/s peak, 8 TB/s HBM bandwidth, 192 GB capacity) nvidia_blackwell_datasheet_2024.
  • Figure 2: Overview of the DUET disaggregated acceleration framework. The Prefill package uses configurable systolic-array chiplets with off-package DRAM, while the Decode package employs configurable vector-unit chiplets with in-package HBM.
  • Figure 3: Overview of SSM computation dao2024transformers and its state-stationary dataflow mapping on systolic array. This operation repeats over the sequence length $L$.
  • Figure 4: Vector-unit array for Decode package.
  • Figure 5: Design space exploration for a single SSM kernel in Nemotron-H-56B blakeman2025nemotron. (a) Systolic array used in the prefill phase, (b) Vector-unit array used in the decode phase.
  • ...and 1 more figures