Table of Contents
Fetching ...

MergeIT: From Selection to Merging for Efficient Instruction Tuning

Hongyi Cai, Yuqian Fu, Hongming Fu, Bo Zhao

TL;DR

MergeIT shifts instruction tuning from data selection to synthesis by first applying topic-aware filtering via K-means clustering and a facility-location-based subset selection to preserve diversity, then using LLMs to merge semantically related instructions into richer, compact samples. This two-stage approach reduces dataset size (to about 20%, then merged to ~6k data) while increasing informational density and task coverage, addressing computational cost and diversity concerns of LLM-scored selection. Empirical results show state-of-the-art or competitive performance across multiple benchmarks (MT-Bench, Hellaswag, MMLU, GSM8k, ARC, TruthfulQA) and strong gains against baselines, validating LLM-based merging as a viable alternative to traditional scoring-based selection for instruction tuning. The work also demonstrates practical benefits, such as improved explanation depth and knowledge integration in merged outputs, and discusses scalability with smaller open-source models for merging.

Abstract

Instruction tuning is crucial for optimizing Large Language Models (LLMs), yet mainstream data selection methods heavily rely on LLMs as instruction quality scorers, leading to high computational costs and reduced data diversity. To address these limitations, we propose MergeIT, a novel LLM-based Merging strategy for better Instruction Tuning that shifts the focus from selection to synthesis. MergeIT operates in two stages: first, topic-aware filtering clusters and refines the dataset, preserving diversity while eliminating redundancy without relying on LLM-based scoring. Second, LLM-based merging synthesizes semantically similar instructions into more informative and compact training data, enhancing data richness while further reducing dataset size. Experimental results demonstrate that MergeIT enables efficient, diverse, and scalable instruction selection and synthesis, establishing LLM-based merging as a promising alternative to conventional scoring-based selection methods for instruction tuning. Our source code and datasets are now available at https://github.com/XcloudFance/MergeIT

MergeIT: From Selection to Merging for Efficient Instruction Tuning

TL;DR

MergeIT shifts instruction tuning from data selection to synthesis by first applying topic-aware filtering via K-means clustering and a facility-location-based subset selection to preserve diversity, then using LLMs to merge semantically related instructions into richer, compact samples. This two-stage approach reduces dataset size (to about 20%, then merged to ~6k data) while increasing informational density and task coverage, addressing computational cost and diversity concerns of LLM-scored selection. Empirical results show state-of-the-art or competitive performance across multiple benchmarks (MT-Bench, Hellaswag, MMLU, GSM8k, ARC, TruthfulQA) and strong gains against baselines, validating LLM-based merging as a viable alternative to traditional scoring-based selection for instruction tuning. The work also demonstrates practical benefits, such as improved explanation depth and knowledge integration in merged outputs, and discusses scalability with smaller open-source models for merging.

Abstract

Instruction tuning is crucial for optimizing Large Language Models (LLMs), yet mainstream data selection methods heavily rely on LLMs as instruction quality scorers, leading to high computational costs and reduced data diversity. To address these limitations, we propose MergeIT, a novel LLM-based Merging strategy for better Instruction Tuning that shifts the focus from selection to synthesis. MergeIT operates in two stages: first, topic-aware filtering clusters and refines the dataset, preserving diversity while eliminating redundancy without relying on LLM-based scoring. Second, LLM-based merging synthesizes semantically similar instructions into more informative and compact training data, enhancing data richness while further reducing dataset size. Experimental results demonstrate that MergeIT enables efficient, diverse, and scalable instruction selection and synthesis, establishing LLM-based merging as a promising alternative to conventional scoring-based selection methods for instruction tuning. Our source code and datasets are now available at https://github.com/XcloudFance/MergeIT

Paper Structure

This paper contains 21 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison between our method and prior works. Unlike prior works that primarily use LLMs as scorers, we novelly explore their role as mergers, enhancing diversity and time efficiency.
  • Figure 2: t-SNE visualization of K-means on Alpaca_52k. The instructions naturally form distinct clusters, indicating an inherent topical structure effectively captured by clustering.
  • Figure 3: Overview of MergeIT: 1) Topic-aware filtering clusters instructions into topics and filters redundant samples within each topic. 2) LLM-based merging synthesizes new instructions by combining similar pairs.
  • Figure 4: The figure shows the comparison between different scales of number of data in instruction tuning.
  • Figure 5: AlpacaEval results. Compared models are MergeIT-6k V.S Alapca-52k full samples (line 1), MergeIT-9k V.S Alapca-52k full samples (line 2) and Superfiltering-6k V.S Alapca-52k full samples (line 3)