Table of Contents
Fetching ...

Characterizing State Space Model and Hybrid Language Model Performance with Long Context

Saptarshi Mitra, Rachid Karami, Haocheng Xu, Sitao Huang, Hyoukjun Kwon

TL;DR

This work presents a comprehensive, compara-ive benchmarking of carefully selected Transformers, SSMs, and hybrid models specifically for long-context inference on consumer and embedded GPUs and shows that SSMs are well-suited for on-device AI on consumer and embedded GPUs for long context inferences.

Abstract

Emerging applications such as AR are driving demands for machine intelligence capable of processing continuous and/or long-context inputs on local devices. However, currently dominant models based on Transformer architecture suffers from the quadratic computational and memory overhead, which hinders applications required to process long contexts. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and SSM-Transformer hybrid models, which provide near-linear scaling. The near-linear scaling enabled efficient handling of millions of tokens while delivering high performance in recent studies. Although such works present promising results, their workload characteristics in terms of computational performance and hardware resource requirements are not yet thoroughly explored, which limits our understanding of their implications to the system level optimizations. To address this gap, we present a comprehensive, compara-ive benchmarking of carefully selected Transformers, SSMs, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis shows that SSMs are well-suited for on-device AI on consumer and embedded GPUs for long context inferences. While Transformers are up to 1.9x faster at short sequences (<8K tokens), SSMs demonstrate a dramatic performance inversion, becoming up to 4x faster at very long contexts (~57K tokens), thanks to their linear computational complexity and ~64% reduced memory footrprint. Our operator-level analysis reveals that custom SSM kernels like selective scan despite being hardware-aware to minimize memory IO, dominate the inference runtime on edge platforms, accounting for over 55% of latency due to their sequential, element-wise nature. To foster further research, we will open-source our characterization framework.

Characterizing State Space Model and Hybrid Language Model Performance with Long Context

TL;DR

This work presents a comprehensive, compara-ive benchmarking of carefully selected Transformers, SSMs, and hybrid models specifically for long-context inference on consumer and embedded GPUs and shows that SSMs are well-suited for on-device AI on consumer and embedded GPUs for long context inferences.

Abstract

Emerging applications such as AR are driving demands for machine intelligence capable of processing continuous and/or long-context inputs on local devices. However, currently dominant models based on Transformer architecture suffers from the quadratic computational and memory overhead, which hinders applications required to process long contexts. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and SSM-Transformer hybrid models, which provide near-linear scaling. The near-linear scaling enabled efficient handling of millions of tokens while delivering high performance in recent studies. Although such works present promising results, their workload characteristics in terms of computational performance and hardware resource requirements are not yet thoroughly explored, which limits our understanding of their implications to the system level optimizations. To address this gap, we present a comprehensive, compara-ive benchmarking of carefully selected Transformers, SSMs, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis shows that SSMs are well-suited for on-device AI on consumer and embedded GPUs for long context inferences. While Transformers are up to 1.9x faster at short sequences (<8K tokens), SSMs demonstrate a dramatic performance inversion, becoming up to 4x faster at very long contexts (~57K tokens), thanks to their linear computational complexity and ~64% reduced memory footrprint. Our operator-level analysis reveals that custom SSM kernels like selective scan despite being hardware-aware to minimize memory IO, dominate the inference runtime on edge platforms, accounting for over 55% of latency due to their sequential, element-wise nature. To foster further research, we will open-source our characterization framework.

Paper Structure

This paper contains 24 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: TTFT (a,b) and TPOT (c,d) scaling comparison of Qwen2.5-0.5B yang2025qwen3 and Mamba2-780m dao2024transformers. While Qwen is faster (1.9$\times$) at shorter sequence lengths, Mamba2's superior scaling provides a significant performance advantage(2.65$\times$) at longer contexts for both prefill and decode (for generation length 256 with batch size 1) stages.
  • Figure 2: (a) Basic building block of Transformers: a scaled dot-product attention module; (b) a S6 block showing fundamental computation of SSMs. (c) Overview of auto-regressive generation (prefill & decode) of LLMs
  • Figure 3: Accuracy-latency efficiency frontier analysis of Transformer (Qwen2.5), SSM (Mamba2), and Hybrid (Falcon-H1) models of similar size ($\approx$1.5B) for 57K sequence length during prefill stage.
  • Figure 4: Overall flow of the characterization framework for language models
  • Figure 5: Memory footprint of prefill stage for Transformer, SSM, and Hybrid models on (a) consumer GPU (RTX 4090) and (b) edge GPU (Jetson Nano Orin)
  • ...and 4 more figures