Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization
Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci
TL;DR
This work tackles the high cost of Data Mixture Optimization (DMO) for multimodal LLM fine-tuning by introducing a simple, scalable proxy based on linear model merging. It trains domain-specific experts and forms merged proxies $\boldsymbol{\theta}^M_{\mathbf{w}} = \sum_i w_i \boldsymbol{\theta}_i$ to rank candidate mixtures via the target performance $f(\boldsymbol{\theta}^M_{\mathbf{w}})$, avoiding per-mixture training. Empirical results across 14 benchmarks, two model families, and varying domain counts demonstrate strong rank correlation between merged proxies and true mixture-trained models, with cross-budget efficiency (50k-budget experts suffice) and superior performance to regression-based DMO. Theoretical intuition via a second-order Taylor expansion under local convexity supports linear merging, and visualizations confirm the linear arrangement of mixture-trained models along the expert axis. Overall, linear merging enables cheap, reliable DMO for SFT of multimodal LLMs, with practical impact on scalability and resource use ($\boldsymbol{\theta}^M_{\mathbf{w}} = \sum_i w_i \boldsymbol{\theta}_i$; $\mathcal{L}(\boldsymbol{\theta}, \mathcal{D}_{\mathbf{w}}) \approx \sum_i w_i [\cdot]$).
Abstract
Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.
