Table of Contents
Fetching ...

Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference

Tong Wu, Yutong He, Bin Wang, Kun Yuan

TL;DR

The paper tackles the activation-memory bottleneck in large language models by profiling activation usage under FlashAttention and introducing Mixture-of-Channels (MoC), an FFN architecture that activates only the Top-K channels per token via SwiGLU gating. MoC reduces activation memory during pre-training and speeds up decoding by loading only relevant weights, aided by hardware-aware kernels and gradient-checkpointing. Across LLaMA-family models and beyond, MoC achieves substantial memory savings with competitive perplexity and delivers end-to-end inference speedups of around 1.13×, while preserving model fidelity. The approach is orthogonal to attention-optimization methods like FlashAttention and can complement mixed-precision and other memory-saving techniques, with promising potential for MoE integration in future work.

Abstract

Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks, driven by scaling laws that correlate model size and training data with performance improvements. However, this scaling paradigm incurs substantial memory overhead, creating significant challenges for both training and inference. While existing research has primarily addressed parameter and optimizer state memory reduction, activation memory-particularly from feed-forward networks (FFNs)-has become the critical bottleneck, especially when FlashAttention is implemented. In this work, we conduct a detailed memory profiling of LLMs and identify FFN activations as the predominant source to activation memory overhead. Motivated by this, we introduce Mixture-of-Channels (MoC), a novel FFN architecture that selectively activates only the Top-K most relevant channels per token determined by SwiGLU's native gating mechanism. MoC substantially reduces activation memory during pre-training and improves inference efficiency by reducing memory access through partial weight loading into GPU SRAM. Extensive experiments validate that MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.

Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference

TL;DR

The paper tackles the activation-memory bottleneck in large language models by profiling activation usage under FlashAttention and introducing Mixture-of-Channels (MoC), an FFN architecture that activates only the Top-K channels per token via SwiGLU gating. MoC reduces activation memory during pre-training and speeds up decoding by loading only relevant weights, aided by hardware-aware kernels and gradient-checkpointing. Across LLaMA-family models and beyond, MoC achieves substantial memory savings with competitive perplexity and delivers end-to-end inference speedups of around 1.13×, while preserving model fidelity. The approach is orthogonal to attention-optimization methods like FlashAttention and can complement mixed-precision and other memory-saving techniques, with promising potential for MoE integration in future work.

Abstract

Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks, driven by scaling laws that correlate model size and training data with performance improvements. However, this scaling paradigm incurs substantial memory overhead, creating significant challenges for both training and inference. While existing research has primarily addressed parameter and optimizer state memory reduction, activation memory-particularly from feed-forward networks (FFNs)-has become the critical bottleneck, especially when FlashAttention is implemented. In this work, we conduct a detailed memory profiling of LLMs and identify FFN activations as the predominant source to activation memory overhead. Motivated by this, we introduce Mixture-of-Channels (MoC), a novel FFN architecture that selectively activates only the Top-K most relevant channels per token determined by SwiGLU's native gating mechanism. MoC substantially reduces activation memory during pre-training and improves inference efficiency by reducing memory access through partial weight loading into GPU SRAM. Extensive experiments validate that MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.

Paper Structure

This paper contains 23 sections, 1 theorem, 13 equations, 6 figures, 20 tables.

Key Result

Theorem 1

For all $a \le b$ and $d_{\mathrm{ffn}}\in\mathbb{N}^*$, it holds that (Proof is in Appendix app:proof)

Figures (6)

  • Figure 1: Memory breakdown of pre-training LLaMA-2 with a fixed sequence length of 256 and various batch size choices.
  • Figure 2: An illustration of the Mixture-of-channels (MoC) architecture and its modification to the standard SwiGLU FFN. The output of the gate projection is filtered by a Top-$K$ operator. Here, $U'$ and $G'$ denote the sparsified versions of $U$ and $G$, respectively. In MoC, components painted in blue are stored as activations during the forward pass, and those painted in yellow will be efficiently recomputed during backpropagation.
  • Figure 3: LLaMA-2 architecture.
  • Figure 4: SiLU activation.
  • Figure 5: Histograms of pre-SiLU and post-SiLU activations from different layers of LLaMA-2. Subfigures (a), (b), and (c) correspond to the pre-SiLU activations, while subfigures (d), (e), and (f) show the post-SiLU activations. The blue dashed line marks the threshold for the top 30% of activations by value, and the red curve represents the cumulative distribution.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof : Proof of Theorem \ref{['thm:exp-pow']}