Table of Contents
Fetching ...

Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts

Leiyu Pan, Zhenpeng Su, Minxuan Lv, Yizhe Xiong, Xiangwen Zhang, Zijia Lin, Hui Chen, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Deyi Xiong

TL;DR

Finedeep addresses sparse activation in dense LLMs by partitioning FFNs into fine-grained experts arranged across multiple sub-layers, and by routing contributions with an output-guided sigmoid mechanism. The approach preserves parameter count while effectively increasing activation utilization, achieves perplexity and benchmark improvements across model scales, and demonstrates that a balanced depth-width configuration yields optimal performance. Key innovations include MK FFN partitioning, multi-layer expert arrangements, and sigmoid-based routing with RMSNorm residuals, along with an NSAR-based confirmation of reduced sparsity. The results suggest practical benefits for dense transformer deployment by expanding representational capacity without a parameter explosion, with implications for scalable, high-performing language models.

Abstract

Large language models have demonstrated exceptional performance across a wide range of tasks. However, dense models usually suffer from sparse activation, where many activation values tend towards zero (i.e., being inactivated). We argue that this could restrict the efficient exploration of model representation space. To mitigate this issue, we propose Finedeep, a deep-layered fine-grained expert architecture for dense models. Our framework partitions the feed-forward neural network layers of traditional dense models into small experts, arranges them across multiple sub-layers. A novel routing mechanism is proposed to determine each expert's contribution. We conduct extensive experiments across various model sizes, demonstrating that our approach significantly outperforms traditional dense architectures in terms of perplexity and benchmark performance while maintaining a comparable number of parameters and floating-point operations. Moreover, we find that Finedeep achieves optimal results when balancing depth and width, specifically by adjusting the number of expert sub-layers and the number of experts per sub-layer. Empirical results confirm that Finedeep effectively alleviates sparse activation and efficiently utilizes representation capacity in dense models.

Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts

TL;DR

Finedeep addresses sparse activation in dense LLMs by partitioning FFNs into fine-grained experts arranged across multiple sub-layers, and by routing contributions with an output-guided sigmoid mechanism. The approach preserves parameter count while effectively increasing activation utilization, achieves perplexity and benchmark improvements across model scales, and demonstrates that a balanced depth-width configuration yields optimal performance. Key innovations include MK FFN partitioning, multi-layer expert arrangements, and sigmoid-based routing with RMSNorm residuals, along with an NSAR-based confirmation of reduced sparsity. The results suggest practical benefits for dense transformer deployment by expanding representational capacity without a parameter explosion, with implications for scalable, high-performing language models.

Abstract

Large language models have demonstrated exceptional performance across a wide range of tasks. However, dense models usually suffer from sparse activation, where many activation values tend towards zero (i.e., being inactivated). We argue that this could restrict the efficient exploration of model representation space. To mitigate this issue, we propose Finedeep, a deep-layered fine-grained expert architecture for dense models. Our framework partitions the feed-forward neural network layers of traditional dense models into small experts, arranges them across multiple sub-layers. A novel routing mechanism is proposed to determine each expert's contribution. We conduct extensive experiments across various model sizes, demonstrating that our approach significantly outperforms traditional dense architectures in terms of perplexity and benchmark performance while maintaining a comparable number of parameters and floating-point operations. Moreover, we find that Finedeep achieves optimal results when balancing depth and width, specifically by adjusting the number of expert sub-layers and the number of experts per sub-layer. Empirical results confirm that Finedeep effectively alleviates sparse activation and efficiently utilizes representation capacity in dense models.

Paper Structure

This paper contains 22 sections, 17 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Distribution of activation function outputs across various models touvron2023llamadubey2024llamayang2024qwen2abdin2024phi, where all selected models use the SiLU activation function. The horizontal axis represents the activation values, while the vertical axis denotes the distribution of activation values across different models.
  • Figure 2: Illustration of the proposed Finedeep. Subfigure (a) illustrates the structure of the original dense model. Subfigure (b) demonstrates the structure of our proposed Finedeep model. Each FFN in the dense model is partitioned into $M\times K$ experts distributed along $M$ sub-layers with $K$ experts per sub-layer. The connection between subfigures (a) and (b) represents the transformation process from the original dense model to the Finedeep model.
  • Figure 3: Output distributions of the activation functions for Finedeep and the baseline model.
  • Figure 4: Variation of $\mathrm{NSAR}_{0.1}$ metrics across different model layers.
  • Figure 5: T-SNE clustering of activation values from different layers in the traditional dense model and the model trained with the Finedeep method.