Table of Contents
Fetching ...

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, Dacheng Tao

TL;DR

A model merging benchmark for MLLMs is introduced, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, and a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions is proposed.

Abstract

Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, $\textbf{(i)}$ we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. $\textbf{(ii)}$ We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. $\textbf{(iii)}$ We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

TL;DR

A model merging benchmark for MLLMs is introduced, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, and a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions is proposed.

Abstract

Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.

Paper Structure

This paper contains 35 sections, 7 theorems, 32 equations, 6 figures, 11 tables.

Key Result

Theorem 3.1

Consider task $i$ trained for $T$ iterations of gradient descent with a fixed step size $\eta \in (0,1/L]$, where $L$ is the Lipschitz constant. Let $\gamma := 1 - \eta\mu \in (0,1)$ denote the PL convergence factor. Then the merged update $\bm{\tau}_m := \sum_{j=1}^m \alpha_j \bm{\tau}_j$ satisfies where $\mathcal{O}(\gamma^T)$ is the residual error from incomplete convergence on task $i$, $\math

Figures (6)

  • Figure 1: Unifying the capabilities or modalities of MLLMs from open-source communities via model merging, which is a data-free, cost-effective post-hoc method.
  • Figure 2: Visualization of task vectors from the benchmark, revealing the small extent of parameter changes during fine-tuning. InternVL2.5 (full fine-tuning) and Qwen2-VL (low-rank adaptation) exhibit distinct distribution patterns across different tasks.
  • Figure 3: When optimizing \ref{['eq:wudi']}, $\boldsymbol{\tau}_{m}$ tends to take shortcuts by increasing its magnitude to achieve orthogonality.
  • Figure 4: We plot the progression of the Frobenius norm of the merged vector during optimization (average by layers).
  • Figure 5: Accuracy of CLIP pre-trained ViT-B/32 fine-tuned separately on eight downstream datasets. As training steps increase, performance on each dataset gradually converges.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Theorem 3.1
  • Lemma A.6: Cross-task cosine leakage
  • proof : Proof sketch
  • Lemma A.7: PL convergence under GD
  • proof
  • Lemma A.8: Task vector norm
  • proof
  • Lemma A.9: Inner-product lower bound
  • proof
  • Theorem A.10: Finite-step bound
  • ...and 3 more