Table of Contents
Fetching ...

Compression of Neural Machine Translation Models via Pruning

Abigail See, Minh-Thang Luong, Christopher D. Manning

TL;DR

This work demonstrates that simple magnitude-based weight pruning, paired with retraining, can dramatically compress LSTM-based NMT models with minimal or no loss in translation quality. Among three pruning schemes, class-blind pruning delivers the best performance across pruning levels, enabling up to 80% pruning while recovering or surpassing the baseline BLEU scores after retraining. The study reports a storage reduction from 782 MB to 272 MB (65.2% smaller) and reveals patterns of redundancy, notably higher layers and attention/softmax components being crucial, while embedding weights for rare words are highly redundant. The findings generalize to smaller multilingual NMT systems and offer practical guidance for deploying compressed NMT models and understanding redundancy in neural architectures.

Abstract

Neural Machine Translation (NMT), like many other deep learning domains, typically suffers from over-parameterization, resulting in large storage sizes. This paper examines three simple magnitude-based pruning schemes to compress NMT models, namely class-blind, class-uniform, and class-distribution, which differ in terms of how pruning thresholds are computed for the different classes of weights in the NMT architecture. We demonstrate the efficacy of weight pruning as a compression technique for a state-of-the-art NMT system. We show that an NMT model with over 200 million parameters can be pruned by 40% with very little performance loss as measured on the WMT'14 English-German translation task. This sheds light on the distribution of redundancy in the NMT architecture. Our main result is that with retraining, we can recover and even surpass the original performance with an 80%-pruned model.

Compression of Neural Machine Translation Models via Pruning

TL;DR

This work demonstrates that simple magnitude-based weight pruning, paired with retraining, can dramatically compress LSTM-based NMT models with minimal or no loss in translation quality. Among three pruning schemes, class-blind pruning delivers the best performance across pruning levels, enabling up to 80% pruning while recovering or surpassing the baseline BLEU scores after retraining. The study reports a storage reduction from 782 MB to 272 MB (65.2% smaller) and reveals patterns of redundancy, notably higher layers and attention/softmax components being crucial, while embedding weights for rare words are highly redundant. The findings generalize to smaller multilingual NMT systems and offer practical guidance for deploying compressed NMT models and understanding redundancy in neural architectures.

Abstract

Neural Machine Translation (NMT), like many other deep learning domains, typically suffers from over-parameterization, resulting in large storage sizes. This paper examines three simple magnitude-based pruning schemes to compress NMT models, namely class-blind, class-uniform, and class-distribution, which differ in terms of how pruning thresholds are computed for the different classes of weights in the NMT architecture. We demonstrate the efficacy of weight pruning as a compression technique for a state-of-the-art NMT system. We show that an NMT model with over 200 million parameters can be pruned by 40% with very little performance loss as measured on the WMT'14 English-German translation task. This sheds light on the distribution of redundancy in the NMT architecture. Our main result is that with retraining, we can recover and even surpass the original performance with an 80%-pruned model.

Paper Structure

This paper contains 17 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: A simplified diagram of NMT.
  • Figure 2: NMT architecture. This example has two layers, but our system has four. The different weight classes are indicated by arrows of different color (the black arrows in the top right represent simply choosing the highest-scoring word, and thus require no parameters). Best viewed in color.
  • Figure 3: Effects of different pruning schemes.
  • Figure 4: 'Breakdown' of performance loss (i.e., perplexity increase) by weight class, when pruning 90% of weights using each of the three pruning schemes. Each of the first eight classes have 8 million weights, attention has 2 million, and the last three have 50 million weights each.
  • Figure 5: Magnitude of largest deleted weight vs. perplexity change, for the 12 different weight classes when pruning 90% of parameters by class-uniform pruning.
  • ...and 3 more figures