Dataless Knowledge Fusion by Merging Weights of Language Models

Xisen Jin; Xiang Ren; Daniel Preotiuc-Pietro; Pengxiang Cheng

Dataless Knowledge Fusion by Merging Weights of Language Models

Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, Pengxiang Cheng

TL;DR

This work tackles fusing knowledge across privately trained language models without sharing training data. It introduces Regression Mean (RegMean), a closed-form, data-free merging method that leverages inner product matrices of linear layer inputs to compute a single merged model, extending to transformer architectures. Empirical results show RegMean often outperforms simple averaging, Fisher-weighted averaging, and ensembling in both in-domain and out-of-domain evaluations, and can match or exceed multi-task learning in certain settings while being more parameter-efficient. The paper demonstrates RegMean’s effectiveness across domain diversity, tasks, and model architectures, highlighting practical advantages for privacy-preserving, multi-domain knowledge fusion. It also discusses limitations and future directions, including privacy concerns around computed statistics and scalability to heterogeneous architectures.

Abstract

Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.

Dataless Knowledge Fusion by Merging Weights of Language Models

TL;DR

Abstract

Paper Structure (31 sections, 8 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 31 sections, 8 equations, 8 figures, 8 tables, 1 algorithm.

Introduction
Dataless Model Merging for Knowledge Fusion
Regression Mean for Model Merging
Preliminaries
Fisher-Weighted Averaging (Fisher)
Merging Linear Models
RegMean for Transformer Language Models
Properties of RegMean
Experimental Setup
Evaluation Settings
Compared Methods
Experiment Details
Results
Model Merging for Fusing In-Domain Knowledge
Merging Models Trained on Non-i.i.d. Partitions.
...and 16 more sections

Figures (8)

Figure 1: Diagram containing the problem formation for model merging and its comparison to other setups including multi-task learning, model ensembling and federated learning. Models $f_{1..N}$ trained by individuals or organizations are released to the user (optionally with some statistics) but the training data $D_{1..N}$ is kept private.
Figure 2: Comparison between Simple, Fisher, and RegMean for merging transformer-based language models. Fisher and RegMean require Fisher Information matrix or inner product matrices of layer inputs, but neither of them requires training data. For linear models, RegMean produces optimal weights that minimize $\ell^2$-distance to individual model predictions on the corresponding training sets.
Figure 3: Relative performance drop (%) of pairwise merged models compared to the domain-specific models. Positive values indicate performance improvement after merging. The boxplots summarize results over 10 ($\mathcal{C}_5^2$) or 15 ($\mathcal{C}_6^2$) combinations of 5 or 6 domain-specific models in Emotion and NER. The triangles denote the mean. Note that y-axes are not in the same scale.
Figure 4: Relative performance drop (%) of merged models compared to task-specific models in our pairwise model merging experiments over GLUE.
Figure 5: Performance of RegMean with different values of $\alpha$ in Emotion Classification. $*$ denotes for Simple Average.
...and 3 more figures

Theorems & Definitions (1)

proof

Dataless Knowledge Fusion by Merging Weights of Language Models

TL;DR

Abstract

Dataless Knowledge Fusion by Merging Weights of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (1)