Table of Contents
Fetching ...

Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior

Mingxuan Zhang, Yan Sun, Faming Liang

TL;DR

Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings.

Abstract

Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the limitations of magnitude pruning, especially in the context of transfer learning for modern NLP tasks. In this paper, we introduce a new magnitude-based pruning algorithm called mixture Gaussian prior pruning (MGPP), which employs a mixture Gaussian prior for regularization. MGPP prunes non-expressive weights under the guidance of the mixture Gaussian prior, aiming to retain the model's expressive capability. Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings. Additionally, we provide a theoretical justification for the consistency of the sparse transformer, shedding light on the effectiveness of the proposed pruning method.

Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior

TL;DR

Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings.

Abstract

Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the limitations of magnitude pruning, especially in the context of transfer learning for modern NLP tasks. In this paper, we introduce a new magnitude-based pruning algorithm called mixture Gaussian prior pruning (MGPP), which employs a mixture Gaussian prior for regularization. MGPP prunes non-expressive weights under the guidance of the mixture Gaussian prior, aiming to retain the model's expressive capability. Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings. Additionally, we provide a theoretical justification for the consistency of the sparse transformer, shedding light on the effectiveness of the proposed pruning method.

Paper Structure

This paper contains 32 sections, 16 equations, 3 figures, 13 tables, 2 algorithms.

Figures (3)

  • Figure 1: Visualization of the penalty functions across various regularization methods: The MGP displayed in the plot corresponds to $-\log(\pi(\theta;\lambda=1\times10^{-6}, \sigma_0^2=1\times10^{-7}, \sigma_1^2=0.1))$, where a zoomed-in view for the region near zero is provided. Unlike $L_0$ regularization, which is not differentiable and requires the use of gradient estimators louizos2017learning, the MGP is differentiable across the entire parameter space.
  • Figure 2: A comparative analysis of MGPP and $L_2$: (a) distributions of remaining nonzero parameters, (b) magnitude pruning thresholds during training.
  • Figure A1: How $\lambda$, $\sigma_0^2$, and $\sigma_1^2$ change the landscape of the MGP.