Table of Contents
Fetching ...

Merging Feed-Forward Sublayers for Compressed Transformers

Neha Verma, Kenton Murray, Kevin Duh

TL;DR

Large Transformer models demand compression for deployment. The paper introduces a post-training method to merge and tie adjacent feed-forward (FF) sublayers via permutation-based alignment, reducing parameters while retaining performance across GPT-2, ViT, and OPUS-MT. Key contributions include showing that more than a third of FF sublayers can be merged with minimal loss, demonstrating activation similarity among FF blocks, and providing an extensible toolkit that scales with quantization and QLoRA. This approach offers a practical, hardware-friendly path to smaller, memory-efficient Transformers with preserved accuracy.

Abstract

With the rise and ubiquity of larger deep learning models, the need for high-quality compression techniques is growing in order to deploy these models widely. The sheer parameter count of these models makes it difficult to fit them into the memory constraints of different hardware. In this work, we present a novel approach to model compression by merging similar parameter groups within a model, rather than pruning away less important parameters. Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models, and test our method on language modeling, image classification, and machine translation. With our method, we demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers, and demonstrate improved performance over a strong layer-pruning baseline. For instance, we can remove over 21% of total parameters from a Vision Transformer, while maintaining 99% of its original performance. Additionally, we observe that some groups of feed-forward sublayers exhibit high activation similarity, which may help explain their surprising mergeability.

Merging Feed-Forward Sublayers for Compressed Transformers

TL;DR

Large Transformer models demand compression for deployment. The paper introduces a post-training method to merge and tie adjacent feed-forward (FF) sublayers via permutation-based alignment, reducing parameters while retaining performance across GPT-2, ViT, and OPUS-MT. Key contributions include showing that more than a third of FF sublayers can be merged with minimal loss, demonstrating activation similarity among FF blocks, and providing an extensible toolkit that scales with quantization and QLoRA. This approach offers a practical, hardware-friendly path to smaller, memory-efficient Transformers with preserved accuracy.

Abstract

With the rise and ubiquity of larger deep learning models, the need for high-quality compression techniques is growing in order to deploy these models widely. The sheer parameter count of these models makes it difficult to fit them into the memory constraints of different hardware. In this work, we present a novel approach to model compression by merging similar parameter groups within a model, rather than pruning away less important parameters. Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models, and test our method on language modeling, image classification, and machine translation. With our method, we demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers, and demonstrate improved performance over a strong layer-pruning baseline. For instance, we can remove over 21% of total parameters from a Vision Transformer, while maintaining 99% of its original performance. Additionally, we observe that some groups of feed-forward sublayers exhibit high activation similarity, which may help explain their surprising mergeability.
Paper Structure (36 sections, 5 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 36 sections, 5 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the feed-forward alignment and merging algorithm used to compress models in an example three layers of a Transformer. Multi-headed attention is abbreviated to MHA, feed-forward sublayers are depicted with $W^\text{in}$ and $W^\text{out}$ weights, and Add&Norm operations are depicted with $\bigoplus$, connected by arrows indicating residual connections. Permutation transformation matrices are shown as $P_i$. Our method includes a permutation finding step, applying the transformations, merging transformed parameters, and finally tying the merged parameters. By merging and tying $k$ feed-forwards, we can reduce the model size by $k-1$ feed-forward sublayers.
  • Figure 2: Results across all three tasks depicting compression versus performance results. We include results from our main method, labeled as Permute FF Merge, as well as our method without permutation alignment, depicted as Vanilla FF Merge. We note that our method retains almost complete performance at one-third of feed-forward sublayers removed, across all tasks, and continues to retain high performance at one-half of FF sublayers removed.
  • Figure 3: Results across all three tasks depicting compression versus performance for our method and a strong layer-dropping baseline method. We perform layer dropping for 1/6 and 1/3 of layers dropped, and fine-tune the best pre-tuned set of dropped layers for all sliding windows. Across the parameter reduction range shown, our merging-based compression method outperforms or matches layer-dropping across the three tasks.
  • Figure 4: Performance curves over different ranges of merged feed-forward sublayers representing 1/3 FFs removed. Across all three tasks, there are clear ranges of merged sublayers that retain more performance when merged.
  • Figure 5: CKA plots of feed-forward sublayer hidden states across three different models. In all three settings, we see clear regions of high similarity between different FF layers.
  • ...and 1 more figures