Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

Tong Zhu; Daize Dong; Xiaoye Qu; Jiacheng Ruan; Wenliang Chen; Yu Cheng

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

Tong Zhu, Daize Dong, Xiaoye Qu, Jiacheng Ruan, Wenliang Chen, Yu Cheng

TL;DR

Mixture-of-Experts instruction tuning often suffers from redundant information when diverse datasets are simply concatenated with fixed sampling. The authors propose a dynamic data sampling approach that builds dataset-level representations from MoE gate loads and updates per-dataset sampling weights based on inter-dataset redundancies observed during training, aiming to maximize global performance under a fixed budget. Across two MoE models and multiple instruction datasets, the method consistently improves knowledge & reasoning and open-ended instruction-following tasks without the extra cost of reference-loss estimation, and is analyzed through data combinations, expert specialization, and efficiency studies. The work provides practical insights into dataset curation for MoE-based instruction tuning and contributes an automatic, state-aware sampling mechanism with publicly available code.

Abstract

Mixture-of-Experts (MoE) models have shown remarkable capability in instruction tuning, especially when the number of tasks scales. However, previous methods simply merge all training tasks (e.g. creative writing, coding, and mathematics) and apply fixed sampling weights, without considering the importance of different tasks as the model training state changes. In this way, the most helpful data cannot be effectively distinguished, leading to suboptimal model performance. To reduce the potential redundancies of datasets, we make the first attempt and propose a novel dynamic data mixture for MoE instruction tuning. Specifically, inspired by MoE's token routing preference, we build dataset-level representations and then capture the subtle differences among datasets. Finally, we propose to dynamically adjust the sampling weight of datasets by their inter-redundancies, thus maximizing global performance under a limited training budget. The experimental results on two MoE models demonstrate the effectiveness of our approach on both downstream knowledge \& reasoning tasks and open-ended queries. Code and models are available at https://github.com/Spico197/MoE-SFT .

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

TL;DR

Abstract

Paper Structure (31 sections, 3 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 31 sections, 3 equations, 6 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Mixture-of-Experts.
Instruction Tuning.
Dynamic Data Mixing in Pre-training.
Preliminaries of Mixture-of-Experts
Methodology
Dataset Differences via Gate Load
Dynamic Data Sampling
Experiments
Instruction Tuning Datasets
Evaluation Datasets
Baselines
Implementation Details
Main Results
...and 16 more sections

Figures (6)

Figure 1: Our proposed dynamic data sampling method for instruction tuning. As the training progresses, the model can dynamically adjust the proportion of data sampling. For comparison, previous works concatenate datasets directly and apply fixed sampling weights.
Figure 2: Results on different data combinations. LLaMA-MoE 3.5B-2E is fine-tuned for this experiment. S, O, M, and C denote for ShareGPT, OpenOrca, Math Instruct, and Code Instructions, respectively.
Figure 3: Gate load differences of LLaMA-MoE 3.5B-2E under different training settings. If the experts are less specialized after training, the distances and the $\text{CV}(\mathcal{O}_i)^2$ would go down. For Dynamic and Dynamic w/o balance loss, the "Beginning" stands for the first round of evaluation for easier recording.
Figure 4: Dynamic sampling weights with different evaluation intervals. Experiments are conducted on LLaMA-MoE 3.5B-2E.
Figure 5: Performances with different training steps. Experiments are conducted on LLaMA-MoE 3.5B-2E.
...and 1 more figures

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

TL;DR

Abstract

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (6)