Mamba-PTQ: Outlier Channels in Recurrent Large Language Models
Alessandro Pierro, Steven Abreu
TL;DR
This work addresses the challenge of post-training quantization for recurrent LLMs, focusing on the Mamba family to enable edge deployment. It demonstrates that activation outliers, akin to those observed in transformer models, complicate naive quantization of state-space based LLMs and motivates outlier-aware techniques. The authors outline a baseline quantization approach for Mamba and adapt SmoothQuant to create an outlier-aware variant that uses a per-channel smoothing mechanism to migrate quantization difficulty between activations and weights. Across six downstream tasks and multiple model sizes, the study highlights the impact of outliers on accuracy and establishes a concrete direction for hardware-friendly, outlier-aware quantization of recurrent LLMs with potential applicability to other SSM-based architectures.
Abstract
Modern recurrent layers are emerging as a promising path toward edge deployment of foundation models, especially in the context of large language models (LLMs). Compressing the whole input sequence in a finite-dimensional representation enables recurrent layers to model long-range dependencies while maintaining a constant inference cost for each token and a fixed memory requirement. However, the practical deployment of LLMs in resource-limited environments often requires further model compression, such as quantization and pruning. While these techniques are well-established for attention-based models, their effects on recurrent layers remain underexplored. In this preliminary work, we focus on post-training quantization for recurrent LLMs and show that Mamba models exhibit the same pattern of outlier channels observed in attention-based LLMs. We show that the reason for the difficulty of quantizing SSMs is caused by activation outliers, similar to those observed in transformer-based LLMs. We report baseline results for post-training quantization of Mamba that do not take into account the activation outliers and suggest first steps for outlier-aware quantization.
