Few Dimensions are Enough: Fine-tuning BERT with Selected Dimensions Revealed Its Redundant Nature
Shion Fukuhata, Yoshinobu Kano
TL;DR
This paper demonstrates that fine-tuning BERT exhibits substantial redundancy across token vectors, dimensional subspaces, and transformer layers on GLUE tasks. By evaluating token-wise, dimension-wise, and layer-wise variations, the authors show that information needed for many tasks is not confined to the CLS vector or to the full final-layer representation; two to three dimensions often suffice, and hidden-layer representations can substitute for the final layer under certain conditions. Freezing pre-trained layers generally reduces performance, while cross-task fine-tuning reveals limited catastrophic forgetting due to redundancy, suggesting multi-task robustness. These findings point to strong opportunities for model pruning and efficient inference without large losses in task performance. The work contributes practical insights into dimensionality reduction, subnetwork selection, and cross-task transfer in large pre-trained transformers.
Abstract
When fine-tuning BERT models for specific tasks, it is common to select part of the final layer's output and input it into a newly created fully connected layer. However, it remains unclear which part of the final layer should be selected and what information each dimension of the layers holds. In this study, we comprehensively investigated the effectiveness and redundancy of token vectors, layers, and dimensions through BERT fine-tuning on GLUE tasks. The results showed that outputs other than the CLS vector in the final layer contain equivalent information, most tasks require only 2-3 dimensions, and while the contribution of lower layers decreases, there is little difference among higher layers. We also evaluated the impact of freezing pre-trained layers and conducted cross-fine-tuning, where fine-tuning is applied sequentially to different tasks. The findings suggest that hidden layers may change significantly during fine-tuning, BERT has considerable redundancy, enabling it to handle multiple tasks simultaneously, and its number of dimensions may be excessive.
