Table of Contents
Fetching ...

Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization

Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci

TL;DR

This work tackles the high cost of Data Mixture Optimization (DMO) for multimodal LLM fine-tuning by introducing a simple, scalable proxy based on linear model merging. It trains domain-specific experts and forms merged proxies $\boldsymbol{\theta}^M_{\mathbf{w}} = \sum_i w_i \boldsymbol{\theta}_i$ to rank candidate mixtures via the target performance $f(\boldsymbol{\theta}^M_{\mathbf{w}})$, avoiding per-mixture training. Empirical results across 14 benchmarks, two model families, and varying domain counts demonstrate strong rank correlation between merged proxies and true mixture-trained models, with cross-budget efficiency (50k-budget experts suffice) and superior performance to regression-based DMO. Theoretical intuition via a second-order Taylor expansion under local convexity supports linear merging, and visualizations confirm the linear arrangement of mixture-trained models along the expert axis. Overall, linear merging enables cheap, reliable DMO for SFT of multimodal LLMs, with practical impact on scalability and resource use ($\boldsymbol{\theta}^M_{\mathbf{w}} = \sum_i w_i \boldsymbol{\theta}_i$; $\mathcal{L}(\boldsymbol{\theta}, \mathcal{D}_{\mathbf{w}}) \approx \sum_i w_i [\cdot]$).

Abstract

Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.

Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization

TL;DR

This work tackles the high cost of Data Mixture Optimization (DMO) for multimodal LLM fine-tuning by introducing a simple, scalable proxy based on linear model merging. It trains domain-specific experts and forms merged proxies to rank candidate mixtures via the target performance , avoiding per-mixture training. Empirical results across 14 benchmarks, two model families, and varying domain counts demonstrate strong rank correlation between merged proxies and true mixture-trained models, with cross-budget efficiency (50k-budget experts suffice) and superior performance to regression-based DMO. Theoretical intuition via a second-order Taylor expansion under local convexity supports linear merging, and visualizations confirm the linear arrangement of mixture-trained models along the expert axis. Overall, linear merging enables cheap, reliable DMO for SFT of multimodal LLMs, with practical impact on scalability and resource use (; ).

Abstract

Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.
Paper Structure (23 sections, 10 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 10 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Correlation plots between downstream accuracies of mixture-trained models and our proposed merged proxy. Results are shown for Qwen2-VL-2B and Intern3.5-VL-2B models, fine-tuned on $2,3,4$-domains data mixtures. Each plot reports the Spearman's rank correlation coefficient (R) of the average performance.
  • Figure 2: Cross-data budget correlation plots. Results are shown for Qwen2-VL-2B and Intern3.5-VL-2B models for 4-domains mixtures.
  • Figure 3: Spearman's R correlation coefficient of accuracies predicted from a regressor fitted on an increasing number of data points. Each data point comes from a finetuned model, while the merged proxy requires only the $K$ expert models.
  • Figure 4: Loss functions in the neighbourhood of expert models, along 5 random directions. In a neighbourhood of their minimum, the loss functions remain convex.
  • Figure 5: Projections onto the plane of the expert models. Projections of models finetuned on 2-domains mixtures are aligned to the line connecting the two experts, suggesting they can be approximated by linearly merging the experts' parameters.
  • ...and 3 more figures