A Systematic Study of Model Merging Techniques in Large Language Models
Oğuz Kağan Hitit, Leander Girrbach, Zeynep Akata
TL;DR
The paper investigates whether existing model merging techniques can produce constructive interference in large language models when merging multiple fine-tuned checkpoints. By evaluating six merging methods across four open-weight LLMs and 16 benchmarks, it finds that only Task Arithmetic reliably yields improvements over both the base model and the best individual checkpoint, while interference-aware and subspace-based methods typically degrade performance as the number of merged models grows. The work highlights fundamental limitations of transferring merging techniques from vision-language or smaller-scale models to modern LLMs and motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning. The authors provide a standardized evaluation pipeline and plan to release code to facilitate further research and reproducibility.
Abstract
Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.
