Fisher Mask Nodes for Language Model Merging
Thennal D K, Ganesh Nathan, Suchithra M S
TL;DR
The paper addresses merging multiple task-specific fine-tuned Transformer models into a single multi-task model without relying on validation data. It introduces Fisher Mask Nodes, using the diagonal Fisher information of masks inserted in attention heads and FFN layers to weight-averaged parameters, with $F_j$ derived from mask diagonals $I_{ii}$ and mapped to blocks via $m_{mha}$ and $m_{mlp}$. The merge rule is $\boldsymbol{\theta}^* = \frac{\sum_{j=1}^{M} \lambda_j F_j \boldsymbol{\theta}_j}{\sum_{j=1}^{M} \lambda_j F_j}$. Empirically, the method achieves up to +6.5 accuracy improvements and 57.4× to 321.7× speedups over full Fisher-weighted merging on GLUE tasks across BERT and RoBERTa variants, without requiring a validation set. These results indicate scalable, resource-efficient multi-task merging with broad applicability to future architectures.
Abstract
Fine-tuning pre-trained models provides significant advantages in downstream performance. The ubiquitous nature of pre-trained models such as BERT and its derivatives in natural language processing has also led to a proliferation of task-specific fine-tuned models. As these models typically only perform one task well, additional training or ensembling is required in multi-task scenarios. The growing field of model merging provides a solution, dealing with the challenge of combining multiple task-specific models into a single multi-task model. In this study, we introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning. Utilizing the Fisher information of mask nodes within the Transformer architecture, we devise a computationally efficient weighted-averaging scheme. Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost, with baseline performance improvements of up to +6.5 and a speedup between 57.4x and 321.7x across models. Our results prove the potential of our method in current multi-task learning environments and suggest its scalability and adaptability to new model architectures and learning scenarios.
