Table of Contents
Fetching ...

Improved Methods for Model Pruning and Knowledge Distillation

Wei Jiang, Anying Fu, Youling Zhang

TL;DR

This work addresses the computational burden of large language models by introducing MAMA Pruning, a three-step pruning method that identifies and redistributes weights based on magnitude and dynamic behavior, with post-training GRPO rewards guiding pruning decisions. It also integrates a knowledge distillation perspective within the GPRO framework to balance compression with knowledge retention. Preliminary experiments suggest MAMA can achieve competitive performance relative to state-of-the-art pruning methods across varying sparsity, though gradient- and activation-aware approaches may excel at higher pruning levels. The approach aims to enable faster content generation with minimal degradation, and the authors outline extensive future directions, including broader teacher models and semantic pruning via a wisdom graph.

Abstract

Model pruning is a performance optimization technique for large language models like R1 or o3-mini. However, existing pruning methods often lead to significant performance degradation or require extensive retraining and fine-tuning. This technique aims to identify and remove neurons, connections unlikely leading to the contribution during the human-computer interaction phase. Our goal is to obtain a much smaller and faster knowledge distilled model that can quickly generate content almost as good as those of the unpruned ones. We propose MAMA Pruning, short for Movement and Magnitude Analysis, an improved pruning method that effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels. The improved method is based on weights, bias fixed in the pre-training phase and GRPO rewards verified during the post-training phase as our novel pruning indicators. Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.

Improved Methods for Model Pruning and Knowledge Distillation

TL;DR

This work addresses the computational burden of large language models by introducing MAMA Pruning, a three-step pruning method that identifies and redistributes weights based on magnitude and dynamic behavior, with post-training GRPO rewards guiding pruning decisions. It also integrates a knowledge distillation perspective within the GPRO framework to balance compression with knowledge retention. Preliminary experiments suggest MAMA can achieve competitive performance relative to state-of-the-art pruning methods across varying sparsity, though gradient- and activation-aware approaches may excel at higher pruning levels. The approach aims to enable faster content generation with minimal degradation, and the authors outline extensive future directions, including broader teacher models and semantic pruning via a wisdom graph.

Abstract

Model pruning is a performance optimization technique for large language models like R1 or o3-mini. However, existing pruning methods often lead to significant performance degradation or require extensive retraining and fine-tuning. This technique aims to identify and remove neurons, connections unlikely leading to the contribution during the human-computer interaction phase. Our goal is to obtain a much smaller and faster knowledge distilled model that can quickly generate content almost as good as those of the unpruned ones. We propose MAMA Pruning, short for Movement and Magnitude Analysis, an improved pruning method that effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels. The improved method is based on weights, bias fixed in the pre-training phase and GRPO rewards verified during the post-training phase as our novel pruning indicators. Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.

Paper Structure

This paper contains 10 sections, 2 tables.