Deep Model Fusion: A Survey

Weishi Li; Yong Peng; Miao Zhang; Liang Ding; Han Hu; Li Shen

Deep Model Fusion: A Survey

Weishi Li, Yong Peng, Miao Zhang, Liang Ding, Han Hu, Li Shen

TL;DR

This survey addresses deep model fusion, a framework to merge multiple deep networks to improve accuracy, robustness, and efficiency. It organizes approaches into four families: mode connectivity, alignment, weight averaging, and ensemble learning, and covers methods, theory, and applications including Federated Learning, fine-tuning, distillation, and foundation-model fusion. The work highlights practical benefits, limitations, and pivotal challenges such as computational cost, heterogeneity handling, and scalability. It also outlines future directions, including scalable alignment, subspace fusion, and adaptive fusion for large-scale, heterogeneous systems.

Abstract

Deep model fusion/merging is an emerging technique that merges the parameters or predictions of multiple deep learning models into a single one. It combines the abilities of different models to make up for the biases and errors of a single model to achieve better performance. However, deep model fusion on large-scale deep learning models (e.g., LLMs and foundation models) faces several challenges, including high computational cost, high-dimensional parameter space, interference between different heterogeneous models, etc. Although model fusion has attracted widespread attention due to its potential to solve complex real-world tasks, there is still a lack of complete and detailed survey research on this technique. Accordingly, in order to understand the model fusion method better and promote its development, we present a comprehensive survey to summarize the recent progress. Specifically, we categorize existing deep model fusion methods as four-fold: (1) "Mode connectivity", which connects the solutions in weight space via a path of non-increasing loss, in order to obtain better initialization for model fusion; (2) "Alignment" matches units between neural networks to create better conditions for fusion; (3) "Weight average", a classical model fusion method, averages the weights of multiple models to obtain more accurate results closer to the optimal solution; (4) "Ensemble learning" combines the outputs of diverse models, which is a foundational technique for improving the accuracy and robustness of the final model. In addition, we analyze the challenges faced by deep model fusion and propose possible research directions for model fusion in the future. Our review is helpful in deeply understanding the correlation between different model fusion methods and practical application methods, which can enlighten the research in the field of deep model fusion.

Deep Model Fusion: A Survey

TL;DR

Abstract

Paper Structure (28 sections, 45 equations, 8 figures, 9 tables)

This paper contains 28 sections, 45 equations, 8 figures, 9 tables.

Introduction
Mode Connectivity
Linear Mode Connectivity
Non-linear Mode Connectivity
Mode Connectivity in Subspace
Discussion
Alignment
Re-basin
Activation Matching
Weight Matching
Discussion
Weight Average
Weight Average
SWA
Model Soup
...and 13 more sections

Figures (8)

Figure 1: Schematic diagram of the overall model fusion process, as well as classification and connection of various classification methods.
Figure 2: Mode connectivity schematic diagram in two-dimensional loss landscape and other dimensional subspace. Left: Linear interpolation of the minima in the two basins results in high-loss barriersdraxler2018essentially. The lower two optimums follow a path of near constant low loss (e.g., Bezier curve, Polygonal chain, etc.)garipov2018loss. $\pi(W_{2})$ is the equivalent model of $W_2$ by permutation symmetry, which is located in the same basin as $W_1$. Re-Basin merges models by delivering solutions to individual basins ainsworth2022git. Right: Low loss paths connect multiple minima in subspace(e.g., a low-loss manifold composed of $d$-dim wedges fort2019large), etc.).
Figure 3: Left: general alignment process. Model $A$ is transformed into model $A_{p}$ by reference to model $B$. Then the linear combination of $A_{p}$ and $B$ produces C. Right: adjust the parameter vectors of the two neurons $\vartheta_{m}$,$\vartheta _{n}$ in different hidden layers are close to the replacement point. At the replacement point, brea2019weight, $\vartheta_{m}^{\prime}=\vartheta _{n}^{\prime}$, and the two neurons compute the same function, which means that two neurons can be exchanged.
Figure 4: Comparison of sampling and learning rate schedule of different SWA related methods. (a) SWA: constant learning rates. (b)SWA: cyclical learning rates $\textbf{c}$. (c)SWAD: sample densely. (d)HWA: leverages both online and offline WA, which sampled at different synchronization cycles with a slide window of length $h$, i.e. $\overline{\overline{w_{i}}}=\frac{\sum_{t=i-h+1}^{i} \overline{w_{t}}}{h}$.
Figure 5: The flow chart of Task Arithmetic and LoRA Hubhuang2023lorahub in multi-task scenarios.
...and 3 more figures

Deep Model Fusion: A Survey

TL;DR

Abstract

Deep Model Fusion: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (8)