Table of Contents
Fetching ...

Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction

Anantha Padmanaban Krishna Kumar

TL;DR

Vision Transformers can be overparameterized in their MLP blocks. The authors propose two parameter-efficient variants for ViT-B/16 on ImageNet-1K: GroupedMLP, which shares MLPs across adjacent blocks, and ShallowMLP, which halves the MLP width. Both reduce parameters by 32.7% yet improve top-1 accuracy (81.47% and 81.25%) over the baseline (81.05%) and deliver dramatically better training stability. The results indicate that constraining MLP capacity via sharing or width reduction can serve as a useful inductive bias, suggesting a broader role for architectural constraints in transformer design and optimization.

Abstract

Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7\% of the baseline parameters. Our \emph{GroupedMLP} variant shares MLP weights between adjacent transformer blocks and achieves 81.47\% top-1 accuracy while maintaining the baseline computational cost. Our \emph{ShallowMLP} variant halves the MLP hidden dimension and reaches 81.25\% top-1 accuracy with a 38\% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05\%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47\% to the range 0.03\% to 0.06\%. These results suggest that, for ViT-B/16 on ImageNet-1K with a standard training recipe, the model operates in an overparameterized regime in which MLP capacity can be reduced without harming performance and can even slightly improve it. More broadly, our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases, and highlight the importance of how parameters are allocated when designing Vision Transformers. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/parameter-efficient-vit-mlps.

Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction

TL;DR

Vision Transformers can be overparameterized in their MLP blocks. The authors propose two parameter-efficient variants for ViT-B/16 on ImageNet-1K: GroupedMLP, which shares MLPs across adjacent blocks, and ShallowMLP, which halves the MLP width. Both reduce parameters by 32.7% yet improve top-1 accuracy (81.47% and 81.25%) over the baseline (81.05%) and deliver dramatically better training stability. The results indicate that constraining MLP capacity via sharing or width reduction can serve as a useful inductive bias, suggesting a broader role for architectural constraints in transformer design and optimization.

Abstract

Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7\% of the baseline parameters. Our \emph{GroupedMLP} variant shares MLP weights between adjacent transformer blocks and achieves 81.47\% top-1 accuracy while maintaining the baseline computational cost. Our \emph{ShallowMLP} variant halves the MLP hidden dimension and reaches 81.25\% top-1 accuracy with a 38\% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05\%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47\% to the range 0.03\% to 0.06\%. These results suggest that, for ViT-B/16 on ImageNet-1K with a standard training recipe, the model operates in an overparameterized regime in which MLP capacity can be reduced without harming performance and can even slightly improve it. More broadly, our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases, and highlight the importance of how parameters are allocated when designing Vision Transformers. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/parameter-efficient-vit-mlps.

Paper Structure

This paper contains 12 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Validation accuracy and loss from epoch 150 to 300. Solid lines show the mean across two seeds; shaded regions indicate the range. The baseline exhibits declining accuracy and rising loss after epoch $\sim$220, while both parameter-reduced models maintain stable performance through training completion.