Demystifying Instruction Mixing for Fine-tuning Large Language Models

Renxi Wang; Haonan Li; Minghao Wu; Yuxia Wang; Xudong Han; Chiyu Zhang; Timothy Baldwin

Demystifying Instruction Mixing for Fine-tuning Large Language Models

Renxi Wang, Haonan Li, Minghao Wu, Yuxia Wang, Xudong Han, Chiyu Zhang, Timothy Baldwin

TL;DR

This paper investigates how to mix instruction datasets for fine-tuning large language models, focusing on three instruction types: NLP tasks, coding, and general chat. It systematically analyzes eight mixing configurations and varying ratios and instance counts using LLaMA-2 7B and 13B, evaluating on NLP benchmarks, code generation, and alignment. The results show specialized datasets boost domain-specific performance, but naive mixtures can degrade other capabilities; coding data improves both coding and alignment, while NLP-task data can harm alignment when combined with other types. The findings highlight model size as a key factor in leveraging instruction diversity and provide a foundation for designing future instruction mixtures.

Abstract

Instruction tuning significantly enhances the performance of large language models (LLMs) across various tasks. However, the procedure to optimizing the mixing of instruction datasets for LLM fine-tuning is still poorly understood. This study categorizes instructions into three primary types: NLP downstream tasks, coding, and general chat. We explore the effects of instruction tuning on different combinations of datasets on LLM performance, and find that certain instruction types are more advantageous for specific applications but can negatively impact other areas. This work provides insights into instruction mixtures, laying the foundations for future research.

Demystifying Instruction Mixing for Fine-tuning Large Language Models

TL;DR

Abstract

Paper Structure (23 sections, 3 figures, 3 tables)

This paper contains 23 sections, 3 figures, 3 tables.

Introduction
Related Work
Experimental Setup
Datasets
Evaluation
Models
Results
NLP Tasks and Code Benchmark Results
Mixing with Different Ratios
Number of instances
Alignment Skill Results
Conclusion
Examples of Instruction Types
Alignment Skills Demonstration
Logical Correctness
...and 8 more sections

Figures (3)

Figure 1: Instruction type distribution of P3 and Alpaca. For P3, the statistics come from the original dataset, while for Alpaca, we use a dependency parsing approach to extract the root verb of each instruction.
Figure 2: NLP benchmark scores (avg) and Code benchmark (HumanEval) scores for LLaMA-2-7B tuned with different mixing ratios and different numbers of instances. We keep the number of Alpaca instances constant at 20K and change the number of P3 and CodeAlpaca instances to get different ratios.
Figure 3: Alignment skill assessment prompt (from FLASK ye2023flask). The blue parts are filled by corresponding content.

Demystifying Instruction Mixing for Fine-tuning Large Language Models

TL;DR

Abstract

Demystifying Instruction Mixing for Fine-tuning Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)