Table of Contents
Fetching ...

MERGE$^3$: Efficient Evolutionary Merging on Consumer-grade GPUs

Tommaso Mencattini, Adrian Robert Minut, Donato Crisostomi, Andrea Santilli, Emanuele Rodolà

TL;DR

MERGE$^3$ tackles the high computational barrier of evolutionary model merging by combining a reduced fitness evaluation dataset with Item Response Theory–based ability estimation and IRT-driven performance estimators. The method achieves roughly $50$-fold reductions in compute on consumer-grade GPUs while maintaining or improving downstream accuracy, enabling cross-lingual transfer and multilingual model synthesis. The authors provide theoretical guarantees for the estimators and release the Mergenetic library to democratize access to high-quality model merging. Empirical results demonstrate effective cross-lingual skill transfer (e.g., math from English to Japanese) and superior multilingual merging performance on ARC/GSM8K benchmarks, underscoring the practical impact for low-resource and multilingual NLP applications.

Abstract

Evolutionary model merging enables the creation of high-performing multi-task models but remains computationally prohibitive for consumer hardware. We introduce MERGE$^3$, an efficient framework that makes evolutionary merging feasible on a single GPU by reducing fitness computation costs 50$\times$ while preserving performance. MERGE$^3$ achieves this by Extracting a reduced dataset for evaluation, Estimating model abilities using Item Response Theory (IRT), and Evolving optimal merges via IRT-based performance estimators. Our method enables state-of-the-art multilingual and cross-lingual merging, transferring knowledge across languages with significantly lower computational overhead. We provide theoretical guarantees and an open-source library, democratizing high-quality model merging.

MERGE$^3$: Efficient Evolutionary Merging on Consumer-grade GPUs

TL;DR

MERGE tackles the high computational barrier of evolutionary model merging by combining a reduced fitness evaluation dataset with Item Response Theory–based ability estimation and IRT-driven performance estimators. The method achieves roughly -fold reductions in compute on consumer-grade GPUs while maintaining or improving downstream accuracy, enabling cross-lingual transfer and multilingual model synthesis. The authors provide theoretical guarantees for the estimators and release the Mergenetic library to democratize access to high-quality model merging. Empirical results demonstrate effective cross-lingual skill transfer (e.g., math from English to Japanese) and superior multilingual merging performance on ARC/GSM8K benchmarks, underscoring the practical impact for low-resource and multilingual NLP applications.

Abstract

Evolutionary model merging enables the creation of high-performing multi-task models but remains computationally prohibitive for consumer hardware. We introduce MERGE, an efficient framework that makes evolutionary merging feasible on a single GPU by reducing fitness computation costs 50 while preserving performance. MERGE achieves this by Extracting a reduced dataset for evaluation, Estimating model abilities using Item Response Theory (IRT), and Evolving optimal merges via IRT-based performance estimators. Our method enables state-of-the-art multilingual and cross-lingual merging, transferring knowledge across languages with significantly lower computational overhead. We provide theoretical guarantees and an open-source library, democratizing high-quality model merging.

Paper Structure

This paper contains 56 sections, 4 theorems, 36 equations, 13 figures, 7 tables, 2 algorithms.

Key Result

Theorem 2

Let $D$ be a dataset, let $\bar{D}\subset D$ be a subset, and let $F(\cdot;\bar{D})$ be $\epsilon$-stable with respect to $F(\cdot;D)$, with a fixed $\epsilon>0$. Define Then

Figures (13)

  • Figure 1: Accuracy on Japanese GSM8K over fitness evaluation FLOPs. MERGE$^3$ is competitive with a model evolved on the full dataset by only using a consumer-grade GPU and $2\%$ of the data (point size reflects data amount).
  • Figure 2: $\mathbf{\text{MERGE}^3}$ for math + Japanese merging (GSM8K). The method Extracts a reduced evolutionary dataset, Estimates ability parameters ($\gamma$) via Item Response Theory (IRT) based on their response correctness, and Evolves the endpoint models through iterative merging. Leveraging an IRT-based performance estimator, it approximates full-dataset fitness with reduced data, cutting fitness estimation costs while preserving full-dataset accuracy -- making evolutionary merging feasible on consumer GPUs.
  • Figure 3: Performance Estimators: Absolute error of various estimators as a function of sample size (lower is better). Our mp-IRTmp-IRT and gmp-IRTgmp-IRT estimators consistently achieve lower error across various sample sizes and datasets. Additional results available in \ref{['fig:estimation_comparison_winogrande_hellaswag']}.
  • Figure 4: Ability Estimator: Cosine similarity between estimated and true abilities for different tasks (higher is better). Our estimated abilities $\gamma^{\{\mathrm{mp},\mathrm{gmp}\}-{\mathrm{IRT}}}$ better approximate true abilities.
  • Figure 5: Cross-lingual skill transfer: merging math models (dark blue) with language-specific models (red) effectively transfers mathematical skills across languages (green - our method) compared to baselines (white). Accuracy on GSM8K for each target language.
  • ...and 8 more figures

Theorems & Definitions (10)

  • definition 1: $\epsilon$-Stability.
  • Theorem 2: $\epsilon$-Optimality Preservation
  • definition 3: $\epsilon$-Stability in expectation
  • Theorem 4: Expected $\epsilon$-Stability of the Minimum
  • proposition 5: Asymptotic unbiasedness of mp-IRT mp-IRT
  • Theorem 6: Asymptotic performance preservation of mp-IRT mp-IRT
  • proof
  • proof
  • proof
  • proof