Mamba-PTQ: Outlier Channels in Recurrent Large Language Models

Alessandro Pierro; Steven Abreu

Mamba-PTQ: Outlier Channels in Recurrent Large Language Models

Alessandro Pierro, Steven Abreu

TL;DR

This work addresses the challenge of post-training quantization for recurrent LLMs, focusing on the Mamba family to enable edge deployment. It demonstrates that activation outliers, akin to those observed in transformer models, complicate naive quantization of state-space based LLMs and motivates outlier-aware techniques. The authors outline a baseline quantization approach for Mamba and adapt SmoothQuant to create an outlier-aware variant that uses a per-channel smoothing mechanism to migrate quantization difficulty between activations and weights. Across six downstream tasks and multiple model sizes, the study highlights the impact of outliers on accuracy and establishes a concrete direction for hardware-friendly, outlier-aware quantization of recurrent LLMs with potential applicability to other SSM-based architectures.

Abstract

Modern recurrent layers are emerging as a promising path toward edge deployment of foundation models, especially in the context of large language models (LLMs). Compressing the whole input sequence in a finite-dimensional representation enables recurrent layers to model long-range dependencies while maintaining a constant inference cost for each token and a fixed memory requirement. However, the practical deployment of LLMs in resource-limited environments often requires further model compression, such as quantization and pruning. While these techniques are well-established for attention-based models, their effects on recurrent layers remain underexplored. In this preliminary work, we focus on post-training quantization for recurrent LLMs and show that Mamba models exhibit the same pattern of outlier channels observed in attention-based LLMs. We show that the reason for the difficulty of quantizing SSMs is caused by activation outliers, similar to those observed in transformer-based LLMs. We report baseline results for post-training quantization of Mamba that do not take into account the activation outliers and suggest first steps for outlier-aware quantization.

Mamba-PTQ: Outlier Channels in Recurrent Large Language Models

TL;DR

Abstract

Paper Structure (13 sections, 6 equations, 2 figures, 3 tables)

This paper contains 13 sections, 6 equations, 2 figures, 3 tables.

Introduction
Quantization and outlier channels in LLMs
Method
Mamba model
Baseline quantization
Outlier-aware quantization (e.g., SmoothQuant)
Experiments
Experimental setup
Discussion
Future work
Additional results
Impact of removing outlier channels
Impact of quantization on downstream task accuracy

Figures (2)

Figure 1: Architecture diagram of the Mamba block and details on the absolute maximum activation (on the y-axis) across channels (x-axis), measured on a subset of WikiText-2 merity_pointer_2016 for Mamba-130m. Shaded regions account for six standard deviations.
Figure 2: Average one-shot accuracy on downtream tasks across model sizes for Mamba with different quantization configurations. The accuracy is averaged over all tasks shown in \ref{['tab:lm_results']}.

Mamba-PTQ: Outlier Channels in Recurrent Large Language Models

TL;DR

Abstract

Mamba-PTQ: Outlier Channels in Recurrent Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)