PerfMamba: Performance Analysis and Pruning of Selective State Space Models
Abdullah Al Asif, Mobina Kashaniyan, Sixing Yu, Juan Pablo Muñoz, Ali Jannesari
TL;DR
This work analyzes the runtime behavior and efficiency of selective State Space Models, focusing on Mamba-1 and Mamba-2, to identify the SSM update as the primary cost driver across long sequences. It introduces a Δ-guided structured state pruning technique that removes low-activity state channels while adding a bridging layer to maintain compatibility, yielding up to 1.14× throughput and 11.5% memory savings with controlled accuracy loss. Through comprehensive component-level profiling and long-sequence experiments, the study provides actionable guidance for hardware-software co-optimization of SSM-based architectures in real-world, long-range sequencing tasks. The findings offer a practical pathway to deploy more efficient SSM-based models across modalities with minimal performance degradation, informing future design and optimization of long-sequence sequence models.
Abstract
Recent advances in sequence modeling have introduced selective SSMs as promising alternatives to Transformer architectures, offering theoretical computational efficiency and sequence processing advantages. A comprehensive understanding of selective SSMs in runtime behavior, resource utilization patterns, and scaling characteristics still remains unexplored, thus obstructing their optimal deployment and further architectural improvements. This paper presents a thorough empirical study of Mamba-1 and Mamba-2, systematically profiled for performance to assess the design principles that contribute to their efficiency in state-space modeling. A detailed analysis of computation patterns, memory access, I/O characteristics, and scaling properties was performed for sequence lengths ranging from 64 to 16384 tokens. Our findings show that the SSM component, a central part of the selective SSM architecture, demands a significant portion of computational resources compared to other components in the Mamba block. Based on these insights, we propose a pruning technique that selectively removes low-activity states within the SSM component, achieving measurable throughput and memory gains while maintaining accuracy within a moderate pruning regime. This approach results in performance improvements across varying sequence lengths, achieving a 1.14x speedup and reducing memory usage by 11.50\%. These results offer valuable guidance for designing more efficient SSM architectures that can be applied to a wide range of real-world applications.
