Table of Contents
Fetching ...

Strategic Fusion Optimizes Transformer Compression

Md Shoaibur Rahman

TL;DR

This work tackles transformer compression by examining a broad set of pruning signals and introducing data-driven fusion to guide layer removal. By modeling per-layer impact with 12 signals across activation, information, gradients, weights, and attention, and integrating them via linear regression or random forest, the approach identifies pruning sequences that preserve accuracy while reducing size. Knowledge distillation further boosts the compressed models, frequently surpassing the original accuracy and yielding large gains in accuracy-to-size efficiency. The findings demonstrate that a fusion-based, data-driven pruning framework—especially with random-forest fusion—offers practical, scalable benefits for deploying transformers in resource-constrained settings, while also highlighting the importance of sequencing and potential biological analogies for understanding pruning dynamics.

Abstract

This study investigates transformer model compression by systematically pruning its layers. We evaluated 14 pruning strategies across nine diverse datasets, including 12 strategies based on different signals obtained from layer activations, mutual information, gradients, weights, and attention. To address the limitations of single-signal strategies, we introduced two fusion strategies, linear regression and random forest, which combine individual strategies (i.e., strategic fusion), for more informed pruning decisions. Additionally, we applied knowledge distillation to mitigate any accuracy loss during layer pruning. Our results reveal that random forest strategic fusion outperforms individual strategies in seven out of nine datasets and achieves near-optimal performance in the other two. The distilled random forest surpasses the original accuracy in six datasets and mitigates accuracy drops in the remaining three. Knowledge distillation also improves the accuracy-to-size ratio by an average factor of 18.84 across all datasets. Supported by mathematical foundations and biological analogies, our findings suggest that strategically combining multiple signals can lead to efficient, high-performing transformer models for resource-constrained applications.

Strategic Fusion Optimizes Transformer Compression

TL;DR

This work tackles transformer compression by examining a broad set of pruning signals and introducing data-driven fusion to guide layer removal. By modeling per-layer impact with 12 signals across activation, information, gradients, weights, and attention, and integrating them via linear regression or random forest, the approach identifies pruning sequences that preserve accuracy while reducing size. Knowledge distillation further boosts the compressed models, frequently surpassing the original accuracy and yielding large gains in accuracy-to-size efficiency. The findings demonstrate that a fusion-based, data-driven pruning framework—especially with random-forest fusion—offers practical, scalable benefits for deploying transformers in resource-constrained settings, while also highlighting the importance of sequencing and potential biological analogies for understanding pruning dynamics.

Abstract

This study investigates transformer model compression by systematically pruning its layers. We evaluated 14 pruning strategies across nine diverse datasets, including 12 strategies based on different signals obtained from layer activations, mutual information, gradients, weights, and attention. To address the limitations of single-signal strategies, we introduced two fusion strategies, linear regression and random forest, which combine individual strategies (i.e., strategic fusion), for more informed pruning decisions. Additionally, we applied knowledge distillation to mitigate any accuracy loss during layer pruning. Our results reveal that random forest strategic fusion outperforms individual strategies in seven out of nine datasets and achieves near-optimal performance in the other two. The distilled random forest surpasses the original accuracy in six datasets and mitigates accuracy drops in the remaining three. Knowledge distillation also improves the accuracy-to-size ratio by an average factor of 18.84 across all datasets. Supported by mathematical foundations and biological analogies, our findings suggest that strategically combining multiple signals can lead to efficient, high-performing transformer models for resource-constrained applications.
Paper Structure (19 sections, 27 equations, 8 figures, 1 table)

This paper contains 19 sections, 27 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Comparison of accuracies across nine datasets as the increasing number of transformer layers being pruned. Each colored line corresponds to a distinct pruning strategy, with the dashed line indicating the unpruned baseline accuracy. The plots highlight how different pruning criteria affect model performance at varying compression levels.
  • Figure 2: Ranking of the strategies for various datasets. Bars represent the mean rank of each method, while dots indicate the rank for individual datasets. Ranks are computed by sorting strategies for each dataset based on their maximum accuracy, with the highest accuracy assigned rank 15 (because of a total of 15 strategies including random), the second highest rank 14, and so on.
  • Figure 3: Percentage change in maximum accuracy compared to the baseline for each strategy. Black asterisks indicate that the means are significantly less than zero. Red asterisks indicate that the means are significantly higher than the mean of random method. Green asterisks indicate that the means are not significantly different from zero. All tests are based on Wilcoxon signed-rank test, $p < 0.05$.
  • Figure 4: Maximum accuracy-to-size ratio for each method, averaged across all datasets. Error bars represent the standard error of the mean. Red asterisks indicate strategies in which the ratio is statistically significantly different from the random strategy (Wilcoxon signed-rank test, $p < 0.05$).
  • Figure 5: Accuracy comparison between the original model, compressed model (random forest strategic fusion), and compressed model with knowledge distillation. Distillation surpasses original accuracy for six datasets, and mitigates accuracy drops in the remaining three.
  • ...and 3 more figures