Table of Contents
Fetching ...

Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction

Tingchen Fu, Deng Cai, Lemao Liu, Shuming Shi, Rui Yan

TL;DR

This work investigates the alignment tax that arises during supervised instruction tuning of LLMs, proposing that dataset biases rather than data quality or pretraining forgetting drive degradation in standard benchmarks. The authors introduce disperse-then-merge (DTM), a three-step framework that clusters instruction-following data, trains sub-models on each cluster, and merges them in weight space to mitigate biases without extra inference cost. Across nine benchmarks spanning math reasoning, world knowledge, and code generation, DTM consistently outperforms data-filtering and regularization baselines, with simple average merging proving robust. An analysis of sub-model error sets shows both shared and unique biases; fusion recovers common knowledge while attenuating individual biases, offering a practical, scalable approach to aligning LLMs without sacrificing core capabilities.

Abstract

Supervised fine-tuning (SFT) on instruction-following corpus is a crucial approach toward the alignment of large language models (LLMs). However, the performance of LLMs on standard knowledge and reasoning benchmarks tends to suffer from deterioration at the latter stage of the SFT process, echoing the phenomenon of alignment tax. Through our pilot study, we put a hypothesis that the data biases are probably one cause behind the phenomenon. To address the issue, we introduce a simple disperse-then-merge framework. To be concrete, we disperse the instruction-following data into portions and train multiple sub-models using different data portions. Then we merge multiple models into a single one via model merging techniques. Despite its simplicity, our framework outperforms various sophisticated methods such as data curation and training regularization on a series of standard knowledge and reasoning benchmarks.

Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction

TL;DR

This work investigates the alignment tax that arises during supervised instruction tuning of LLMs, proposing that dataset biases rather than data quality or pretraining forgetting drive degradation in standard benchmarks. The authors introduce disperse-then-merge (DTM), a three-step framework that clusters instruction-following data, trains sub-models on each cluster, and merges them in weight space to mitigate biases without extra inference cost. Across nine benchmarks spanning math reasoning, world knowledge, and code generation, DTM consistently outperforms data-filtering and regularization baselines, with simple average merging proving robust. An analysis of sub-model error sets shows both shared and unique biases; fusion recovers common knowledge while attenuating individual biases, offering a practical, scalable approach to aligning LLMs without sacrificing core capabilities.

Abstract

Supervised fine-tuning (SFT) on instruction-following corpus is a crucial approach toward the alignment of large language models (LLMs). However, the performance of LLMs on standard knowledge and reasoning benchmarks tends to suffer from deterioration at the latter stage of the SFT process, echoing the phenomenon of alignment tax. Through our pilot study, we put a hypothesis that the data biases are probably one cause behind the phenomenon. To address the issue, we introduce a simple disperse-then-merge framework. To be concrete, we disperse the instruction-following data into portions and train multiple sub-models using different data portions. Then we merge multiple models into a single one via model merging techniques. Despite its simplicity, our framework outperforms various sophisticated methods such as data curation and training regularization on a series of standard knowledge and reasoning benchmarks.
Paper Structure (29 sections, 1 equation, 11 figures, 10 tables)

This paper contains 29 sections, 1 equation, 11 figures, 10 tables.

Figures (11)

  • Figure 1: The performance on MMLU and BBH when tuning Llama-2-7b and Llama-2-13b with different sizes of instruction-following data from Tülu-V2-mix.
  • Figure 2: The performance on MMLU (5-shot, accu- racy) and BBH (3-shot, exact match) when tuning Llama-2-7b with different sizes of instruction-following data from the high-quality subset of Tülu-V2-mix.
  • Figure 3: The performance on MMLU (5-shot, accu- racy) and BBH (3-shot, exact match) when tuning Llama-2-7b with different sizes of instruction-following data from Tülu-V2-mix with Replay.
  • Figure 4: The loss variation ratio between the training set and the validation set, or $\Delta\mathcal{L}_{train}/\Delta\mathcal{L}_{val}$ when tuning Llama-2-7b-hf on Tülu-V2-mix data.
  • Figure 5: The correlation between training loss reduction and validation loss reduction at token level.
  • ...and 6 more figures