Table of Contents
Fetching ...

PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training

Haoran Xu, Ziqian Liu, Rong Fu, Zhongling Su, Zerui Wang, Zheng Cai, Zhilin Pei, Xingcheng Zhang

TL;DR

The paper tackles the inefficiency of training with variable-length sequences in the Mamba architecture, where padding or single-sequence processing leads to poor GPU utilization. It introduces PackMamba, which packs variable-length sequences into longer sequences and modifies Conv1d and SSM operators to prevent cross-sequence state leakage, guided by Packing-Unpacking Invariance and targeted memory-optimization techniques. Through SSM-focused operator analysis and careful hardware-software co-design, PackMamba achieves substantial throughput gains on NVIDIA A100 runs, exemplified by up to 3.06x speedups for 1.4B and 2.62x for 2.8B models, with additional kernel-speedups and reduced padding overhead. The work demonstrates that careful reengineering of sequence-wise components, along with memory-access optimizations, can significantly increase the practicality of training long-context generative models, and outlines paths to zero-padding and applicability to very long or infinite sequences in future work.

Abstract

With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory complexity. Nevertheless, the existing training framework of Mamba presents inefficiency with variable-length sequence inputs. Either single-sequence training results in low GPU utilization, or batched processing of variable-length sequences to a maximum length incurs considerable memory and computational overhead. To address this problem, we analyze the performance of bottleneck operators in Mamba under diverse tensor shapes and proposed PackMamba, a high-throughput Mamba that efficiently handles variable-length sequences. Diving deep into state-space models (SSMs), we modify the parallel operators to avoid passing information between individual sequences while maintaining high performance. Experimental results on an NVIDIA A100 GPU demonstrate throughput exceeding the baseline single-sequence processing scheme: 3.06x speedup on the 1.4B model and 2.62x on the 2.8B model.

PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training

TL;DR

The paper tackles the inefficiency of training with variable-length sequences in the Mamba architecture, where padding or single-sequence processing leads to poor GPU utilization. It introduces PackMamba, which packs variable-length sequences into longer sequences and modifies Conv1d and SSM operators to prevent cross-sequence state leakage, guided by Packing-Unpacking Invariance and targeted memory-optimization techniques. Through SSM-focused operator analysis and careful hardware-software co-design, PackMamba achieves substantial throughput gains on NVIDIA A100 runs, exemplified by up to 3.06x speedups for 1.4B and 2.62x for 2.8B models, with additional kernel-speedups and reduced padding overhead. The work demonstrates that careful reengineering of sequence-wise components, along with memory-access optimizations, can significantly increase the practicality of training long-context generative models, and outlines paths to zero-padding and applicability to very long or infinite sequences in future work.

Abstract

With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory complexity. Nevertheless, the existing training framework of Mamba presents inefficiency with variable-length sequence inputs. Either single-sequence training results in low GPU utilization, or batched processing of variable-length sequences to a maximum length incurs considerable memory and computational overhead. To address this problem, we analyze the performance of bottleneck operators in Mamba under diverse tensor shapes and proposed PackMamba, a high-throughput Mamba that efficiently handles variable-length sequences. Diving deep into state-space models (SSMs), we modify the parallel operators to avoid passing information between individual sequences while maintaining high performance. Experimental results on an NVIDIA A100 GPU demonstrate throughput exceeding the baseline single-sequence processing scheme: 3.06x speedup on the 1.4B model and 2.62x on the 2.8B model.
Paper Structure (12 sections, 3 equations, 6 figures, 2 algorithms)

This paper contains 12 sections, 3 equations, 6 figures, 2 algorithms.

Figures (6)

  • Figure 1: PackMamba overview
  • Figure 2: SSM profiling
  • Figure 3: Mamba sequence-wise operators
  • Figure 4: Memory Access Optimization
  • Figure 5: Training Throughput Comparison
  • ...and 1 more figures