Table of Contents
Fetching ...

Quamba: A Post-Training Quantization Recipe for Selective State Space Models

Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu

TL;DR

Quamba tackles deploying selective State Space Models (SSMs) on resource-constrained hardware by PTQ to $8$-bit, addressing the unique input sensitivity and extreme output outliers. It introduces a specialized PQ approach: clip inputs with a percentile and transform SSM outputs with Walsh–Hadamard to achieve an outlier-free space, while quantizing weights/activations and fusing Hadamard operations into the final projection. The results show up to $1.72\times$ speedups on edge hardware with only about a $0.9\%$ drop in zero-shot accuracy, and Pareto-optimal latency-accuracy trade-offs across cloud and edge platforms, including large-scale models like Jamba. Ablation studies validate the necessity of percentile-based input clamping and Hadamard-based output smoothing, and the work demonstrates the practicality of deploying SSM-based models with low-bit quantization for on-device and edge-cloud scenarios.

Abstract

State Space Models (SSMs) have emerged as an appealing alternative to Transformers for large language models, achieving state-of-the-art accuracy with constant memory complexity which allows for holding longer context lengths than attention-based networks. The superior computational efficiency of SSMs in long sequence modeling positions them favorably over Transformers in many scenarios. However, improving the efficiency of SSMs on request-intensive cloud-serving and resource-limited edge applications is still a formidable task. SSM quantization is a possible solution to this problem, making SSMs more suitable for wide deployment, while still maintaining their accuracy. Quantization is a common technique to reduce the model size and to utilize the low bit-width acceleration features on modern computing units, yet existing quantization techniques are poorly suited for SSMs. Most notably, SSMs have highly sensitive feature maps within the selective scan mechanism (i.e., linear recurrence) and massive outliers in the output activations which are not present in the output of token-mixing in the self-attention modules. To address this issue, we propose a static 8-bit per-tensor SSM quantization method which suppresses the maximum values of the input activations to the selective SSM for finer quantization precision and quantizes the output activations in an outlier-free space with Hadamard transform. Our 8-bit weight-activation quantized Mamba 2.8B SSM benefits from hardware acceleration and achieves a 1.72x lower generation latency on an Nvidia Orin Nano 8G, with only a 0.9% drop in average accuracy on zero-shot tasks. The experiments demonstrate the effectiveness and practical applicability of our approach for deploying SSM-based models of all sizes on both cloud and edge platforms.

Quamba: A Post-Training Quantization Recipe for Selective State Space Models

TL;DR

Quamba tackles deploying selective State Space Models (SSMs) on resource-constrained hardware by PTQ to -bit, addressing the unique input sensitivity and extreme output outliers. It introduces a specialized PQ approach: clip inputs with a percentile and transform SSM outputs with Walsh–Hadamard to achieve an outlier-free space, while quantizing weights/activations and fusing Hadamard operations into the final projection. The results show up to speedups on edge hardware with only about a drop in zero-shot accuracy, and Pareto-optimal latency-accuracy trade-offs across cloud and edge platforms, including large-scale models like Jamba. Ablation studies validate the necessity of percentile-based input clamping and Hadamard-based output smoothing, and the work demonstrates the practicality of deploying SSM-based models with low-bit quantization for on-device and edge-cloud scenarios.

Abstract

State Space Models (SSMs) have emerged as an appealing alternative to Transformers for large language models, achieving state-of-the-art accuracy with constant memory complexity which allows for holding longer context lengths than attention-based networks. The superior computational efficiency of SSMs in long sequence modeling positions them favorably over Transformers in many scenarios. However, improving the efficiency of SSMs on request-intensive cloud-serving and resource-limited edge applications is still a formidable task. SSM quantization is a possible solution to this problem, making SSMs more suitable for wide deployment, while still maintaining their accuracy. Quantization is a common technique to reduce the model size and to utilize the low bit-width acceleration features on modern computing units, yet existing quantization techniques are poorly suited for SSMs. Most notably, SSMs have highly sensitive feature maps within the selective scan mechanism (i.e., linear recurrence) and massive outliers in the output activations which are not present in the output of token-mixing in the self-attention modules. To address this issue, we propose a static 8-bit per-tensor SSM quantization method which suppresses the maximum values of the input activations to the selective SSM for finer quantization precision and quantizes the output activations in an outlier-free space with Hadamard transform. Our 8-bit weight-activation quantized Mamba 2.8B SSM benefits from hardware acceleration and achieves a 1.72x lower generation latency on an Nvidia Orin Nano 8G, with only a 0.9% drop in average accuracy on zero-shot tasks. The experiments demonstrate the effectiveness and practical applicability of our approach for deploying SSM-based models of all sizes on both cloud and edge platforms.

Paper Structure

This paper contains 65 sections, 2 theorems, 9 equations, 16 figures, 11 tables.

Key Result

Theorem 4.1

The quantization error of $\Delta(t) = \overline{h}(t) - h(t)$ at time step $t$ for the given discrete linear time-invariant model is bounded such that: $||\Delta(t)||_2 \leq \epsilon b (\frac{1}{1 - ae^{t-T}})$. Consequently, the global quantization error (i.e.,$t=T$) is bounded by : $||\Delta(T)||

Figures (16)

  • Figure 1: We demonstrate that (a) our method achieves Pareto-optimality on the Nano 8G with 1K input tokens. Figure (b) shows latency speedups for long input sequences on the A5000, and Figure (c) shows the memory usage across devices comparing to Pythia 2.8B biderman2023pythia and 4-bit Llama-2-7B touvron2023llama.
  • Figure 2: We analyze the sensitivity of quantization errors for (b) self-attention layers and (c) SSMs input activations. Our study shows that the $x$ tensor causes huge errors at the output $y$ due to the causal relationship of the linear recurrence, which is unique to SSMs. Self-attention layers are more robust to quantization errors. Our method (d) reduces the quantization error for the input sample. In Figure \ref{['fig:mamba_transformer']}, we highlight the smooth, outlier, and sensitive paths in SSMs and self-attention layers.
  • Figure 3: The primary difficulties in quantizing Mamba blocks lie in the precision of the activations input into and output from the selective SSM. Although inputs are numerically small, the quantization step is skewed by the maximum value, causing significant errors in the output SSMs after the linear recurrent system. In contrast, large outliers are observed in the outputs. We use Hadamard matrices to transform the outputs to an outlier-free space.
  • Figure 4: The precision mapping and dataflow of Quamba. All scaling factors are fused in the quantized operations. Element-wise operations like non-linearity and residual addition are also fused into these operations.
  • Figure 5: Pareto front analysis for accuracy vs. latency on A5000 and Nano. Quamba models are on the Pareto front for average accuracy and latency when compared to other SSM and transformer-based LLMs, while also featuring lower memory footprint as evidenced in the figure (size of the circle).
  • ...and 11 more figures

Theorems & Definitions (3)

  • Theorem 4.1
  • Theorem N.1
  • proof