Table of Contents
Fetching ...

Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

Aakanksha, Arash Ahmadian, Seraphina Goldfarb-Tarrant, Beyza Ermis, Marzieh Fadaee, Sara Hooker

TL;DR

This work finds that objective-based merging is more effective than mixing data, with improvements of up to 8% and 10% in general performance and safety respectively, and finds that language-based merging is highly effective.

Abstract

Large Language Models (LLMs) have been adopted and deployed worldwide for a broad variety of applications. However, ensuring their safe use remains a significant challenge. Preference training and safety measures often overfit to harms prevalent in Western-centric datasets, and safety protocols frequently fail to extend to multilingual settings. In this work, we explore model merging in a diverse multi-task setting, combining safety and general-purpose tasks within a multilingual context. Each language introduces unique and varied learning challenges across tasks. We find that objective-based merging is more effective than mixing data, with improvements of up to 8% and 10% in general performance and safety respectively. We also find that language-based merging is highly effective -- by merging monolingually fine-tuned models, we achieve a 4% increase in general performance and 7% reduction in harm across all languages on top of the data mixtures method using the same available data. Overall, our comprehensive study of merging approaches provides a useful framework for building strong and safe multilingual models.

Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

TL;DR

This work finds that objective-based merging is more effective than mixing data, with improvements of up to 8% and 10% in general performance and safety respectively, and finds that language-based merging is highly effective.

Abstract

Large Language Models (LLMs) have been adopted and deployed worldwide for a broad variety of applications. However, ensuring their safe use remains a significant challenge. Preference training and safety measures often overfit to harms prevalent in Western-centric datasets, and safety protocols frequently fail to extend to multilingual settings. In this work, we explore model merging in a diverse multi-task setting, combining safety and general-purpose tasks within a multilingual context. Each language introduces unique and varied learning challenges across tasks. We find that objective-based merging is more effective than mixing data, with improvements of up to 8% and 10% in general performance and safety respectively. We also find that language-based merging is highly effective -- by merging monolingually fine-tuned models, we achieve a 4% increase in general performance and 7% reduction in harm across all languages on top of the data mixtures method using the same available data. Overall, our comprehensive study of merging approaches provides a useful framework for building strong and safe multilingual models.

Paper Structure

This paper contains 16 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of our Mix versus Merge framework: We analyze the differences in merging models on trained with specialized multilingual datasets, particularly in the context of safety, in contrast to those trained directly on mixtures of these datasets. We follow the LLM-as-a-judge approach for evaluating the performance of these models along two axes -- general and safety.
  • Figure 2: Mixing versus merging: Safety and general performance of a 15% Safety Mix model (§\ref{['subsec:training_data_mixtures']}) against SLERP merging, which emerges as the best method for balancing trade-offs, for both SFT and DPO based checkpoints. Lower is better for (a) and higher is better for (b). Both metrics are measured with respect to the Aya 23 base model.
  • Figure 3: Comparison between different merging methods across safety and general performance with DPO checkpoints. Both metrics are measured with respect to the Aya 23 base model. Lower is better for the left and higher is better for the right. The red dashed line represents the model trained on a mixture of safety and general data (15% Safety Mix).
  • Figure 4: Comparison between different merging methods across safety and general performance with SFT checkpoints. Both metrics are measured with respect to the Aya 23 base model. Lower is better for the left and higher is better for the right. The red dashed line represents the model trained on a mixture of safety and general data (15% Safety Mix).
  • Figure 5: Monolingual model merging: We compare mixing vs merging with SFT checkpoints optimized for languages. The "[All]" bars represent model variants with all 6 languages -- English, Hindi, French, Spanish, Arabic and Russian. "[EN,FR,SP]" represents the pool of English, French and Spanish "monolingual" models. Both metrics are measured with respect to the Aya 23 base model. Lower is better for the left and higher is better for the right.
  • ...and 1 more figures