Table of Contents
Fetching ...

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach

TL;DR

This work tackles the resource-intensive training of large language models by introducing Soup Of Category Experts (SoCE), a category-aware model souping framework that leverages benchmark composition to select expert models and non-uniformly weight them. By exploiting weak correlations across benchmark categories, SoCE achieves state-of-the-art performance on the Berkeley Function Calling Leaderboard and shows robust improvements across MGSM and ∞-Bench, while increasing cross-category consistency. The approach is grounded in correlation analysis and Shapley-value analysis to justify candidate selection and weighting, highlighting practical gains from reusing existing checkpoints without retraining. The findings suggest that organized, category-aware model fusion can yield substantial performance and robustness benefits with potential for broad open-source reuse and resource savings in real-world deployments. This work advances efficient model aggregation and provides a principled framework for combining diverse capabilities across LLMs.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

TL;DR

This work tackles the resource-intensive training of large language models by introducing Soup Of Category Experts (SoCE), a category-aware model souping framework that leverages benchmark composition to select expert models and non-uniformly weight them. By exploiting weak correlations across benchmark categories, SoCE achieves state-of-the-art performance on the Berkeley Function Calling Leaderboard and shows robust improvements across MGSM and ∞-Bench, while increasing cross-category consistency. The approach is grounded in correlation analysis and Shapley-value analysis to justify candidate selection and weighting, highlighting practical gains from reusing existing checkpoints without retraining. The findings suggest that organized, category-aware model fusion can yield substantial performance and robustness benefits with potential for broad open-source reuse and resource savings in real-world deployments. This work advances efficient model aggregation and provides a principled framework for combining diverse capabilities across LLMs.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.

Paper Structure

This paper contains 20 sections, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Pearson Correlation of model performance from BFCL leaderboard
  • Figure 2: Intra-benchmark performance Pearson Correlation (Pre-, and Post- souping): This is the pearson correlation matrix of metrics across different categories on (L-R: ) $\sim$800 souped (above) and unsouped (below) checkpoints on BFCL, 40 checkpoints on MGSM and 80 checkpoints for $\infty$Bench. We observe that after souping, the performance across all the categories become highly linearly correlated.
  • Figure 3: Shapley value Analysis: Figure (a) displays the linear correlation amongst categories of the MGSM benchmarks across 80 checkpoints. Table (b) shows the performance per MGSM benchmark category for a set of 4 finetuned huggingface candidate models, and Figure (c) shows the Shapley value plots for single, pairs and triplets of candidates. We clearly see that M1 and M2 are the experts for the least-correlated categories (ES-EN and ZH-EN) and they are also the strongest contributor pair. In parallel, M1 is a strong parent and M4 is a weak parent and the shapley values reflect that as well showcasing that the strength of SoCE's candidate selection approach.
  • Figure 4: Model performance of SoCE and ingredient models on different sub-categories of BFCL.
  • Figure 5: Analysis of performance of 37 souped and unsouped checkpoint on Flores-36): The x-axis contains the index of one souping triplet, i.e, the two parents and the souped output on the FLORES-36 benchmark. The y-axis in the top figure is the count of the number of categories in FLORES-36 and in the bottom figure, it the BLEU metric score. The orange line maps how many souped outputs have a higher score than the at least one parent per category, the blue line maps how many souped outputs have a higher score than both the souped candidates in the top figure. In the bottom figure, the green line shows the smaller average BLEU score between the parents, the purple line shows the higher BLEU score and the red line shows the souped candidate score.
  • ...and 1 more figures