Table of Contents
Fetching ...

A Systematic Study of Model Merging Techniques in Large Language Models

Oğuz Kağan Hitit, Leander Girrbach, Zeynep Akata

TL;DR

The paper investigates whether existing model merging techniques can produce constructive interference in large language models when merging multiple fine-tuned checkpoints. By evaluating six merging methods across four open-weight LLMs and 16 benchmarks, it finds that only Task Arithmetic reliably yields improvements over both the base model and the best individual checkpoint, while interference-aware and subspace-based methods typically degrade performance as the number of merged models grows. The work highlights fundamental limitations of transferring merging techniques from vision-language or smaller-scale models to modern LLMs and motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning. The authors provide a standardized evaluation pipeline and plan to release code to facilitate further research and reproducibility.

Abstract

Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.

A Systematic Study of Model Merging Techniques in Large Language Models

TL;DR

The paper investigates whether existing model merging techniques can produce constructive interference in large language models when merging multiple fine-tuned checkpoints. By evaluating six merging methods across four open-weight LLMs and 16 benchmarks, it finds that only Task Arithmetic reliably yields improvements over both the base model and the best individual checkpoint, while interference-aware and subspace-based methods typically degrade performance as the number of merged models grows. The work highlights fundamental limitations of transferring merging techniques from vision-language or smaller-scale models to modern LLMs and motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning. The authors provide a standardized evaluation pipeline and plan to release code to facilitate further research and reproducibility.

Abstract

Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.

Paper Structure

This paper contains 17 sections, 11 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Our evaluation protocol pairs each base large language model (LLM) with 12 publicly available checkpoints and repeatedly samples subsets to merge. The sampled checkpoints are merged using three task arithmetic (TA) and three subspace merging methods. Resulting merged models are evaluated on 16 standard LLM benchmarks from lm-eval-harness to analyze trends in which merging methods consistently work well on LLMs.
  • Figure 2: Overview of task-arithmetic--based model merging methods: Task Arithmetic, TIES-Merging, and Model Stock. Given a base model $W_0$ and fine-tuned checkpoints $W_i$, Task Arithmetic computes task vectors $\Delta W_i = W_i - W_0$ and merges them via weighted addition. TIES-Merging extends this by (1) trimming small-magnitude parameter updates, (2) enforcing sign-consistent updates across checkpoints, and (3) merging only aligned parameters to reduce interference. Model Stock instead interpolates between $W_0$ and the geometric center of the fine-tuned checkpoints based on estimated inter-model angles.
  • Figure 3: Average accuracy and standard deviation of the models across all benchmarks. From left to right, models are LLama 3.2 3B, Qwen3 4B, LLama 3.1 8B, Qwen3 8B, respectively. Shaded areas indicate the standard deviation over different samples of merged checkpoints.
  • Figure 4: Average $L_2$-norm of the task vectors with respect to the base model as a function of the number of merged checkpoints. Each curve reports the mean Euclidean distance $\lVert\theta_{\text{merged}} - \theta_{\text{base}}\rVert_2$ across samples of merged models, with shaded regions indicating the standard deviation. Higher values indicate larger deviations from the base model in parameter space.
  • Figure 5: Overview of subspace-based model merging methods: TSV-Merge, Iso-C, and Subspace Boosting. These methods operate in low-rank task-update subspaces rather than full weight space. TSV-Merge extracts dominant singular directions for each task update, orthogonalizes them via Procrustes alignment, and recombines the aligned subspaces into a unified low-rank update. Iso-C flattens the singular value spectrum of the Task-Arithmetic update, producing an isotropically scaled representation of its principal directions. Subspace Boosting mitigates rank collapse by elevating weaker singular directions above a cumulative-energy threshold, broadening the effective subspace captured by the merged update. In the illustration, we show the TA+SB variant, but any task-vector-based merging method (e.g. TIES) could be substituted by modifying only how the merged task update is computed before applying the Subspace Boosting operation.
  • ...and 8 more figures