Navigating the Accuracy-Size Trade-Off with Flexible Model Merging
Akash Dhasade, Divyansh Jhunjhunwala, Milos Vujasinovic, Gauri Joshi, Anne-Marie Kermarrec
TL;DR
The paper addresses the challenge of combining multiple fine-tuned models without data, by analyzing the accuracy-size trade-off across the full spectrum of deployed sizes. It introduces FlexMerge, a data-free, block-level merging framework that greedily fuses task-specific blocks and accommodates multiple merging algorithms within a unified workflow. Key findings show that modest increases in deployed size can yield large accuracy gains, and that algorithm rankings vary with size, motivating evaluation beyond the single-model endstate. The framework demonstrates strong empirical performance across vision, NLP, and multi-modal benchmarks, offering practical benefits in storage, inference, and generalization, with efficient merging and reconstruction behavior. Overall, FlexMerge provides a versatile design space for scalable, data-free multi-task fusion that can adapt to deployment constraints and task counts.
Abstract
Model merging has emerged as an efficient method to combine multiple single-task fine-tuned models. The merged model can enjoy multi-task capabilities without expensive training. While promising, merging into a single model often suffers from an accuracy gap with respect to the fine-tuned models. On the other hand, deploying all individual fine-tuned models incurs high storage costs. We propose FlexMerge, a novel data-free model merging framework that: (a) flexibly generates merged models of varying sizes, spanning the full spectrum from a single merged model to retaining all fine-tuned models; and (b) supports multiple merging algorithms in a unified framework. Using FlexMerge, we systematically characterize the accuracy-size trade-off of different algorithms. Our study reveals two key findings: first, even modestly larger merged models can yield steep accuracy gains (up to 13.5% when just doubling the size); second, algorithm rankings are not consistent as size increases, with some methods overtaking others beyond the one-model regime. These results uncover a new design dimension for model merging: developing and comparing algorithms across the full spectrum of sizes rather than only at the single-model limit. Extensive experiments on vision and NLP benchmarks, with up to 30 tasks, confirm the generality and practicality of FlexMerge.
