Table of Contents
Fetching ...

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Yuxin Xiao, Shujian Zhang, Wenxuan Zhou, Marzyeh Ghassemi, Sanqiang Zhao

TL;DR

SFTMix tackles the high cost of acquiring high quality instruction tuning data by introducing a confidence aware Mixup strategy. By partitioning data into confident and unconfident regions based on training dynamics and interpolating between them, SFTMix regularizes learning and mitigates overfitting while improving generalization. Across instruction following and domain specific tasks, SFTMix yields consistent improvements over traditional next token prediction baselines and remains compatible with data selection and parameter efficient tuning approaches. The work demonstrates practical gains across multiple LLM families and highlights the potential for broader applications in efficient data utilization for instruction tuning.

Abstract

To acquire instruction-following capabilities, large language models (LLMs) undergo instruction tuning, where they are trained on instruction-response pairs using next-token prediction (NTP). Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning (SFT) datasets, typically requiring data filtering with proprietary LLMs or human annotation. In this paper, we take a different approach by proposing SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning without relying on well-curated datasets. We observe that LLMs exhibit uneven confidence across the semantic representation space. We argue that examples with different confidence levels should play distinct roles in instruction tuning: Confident data is prone to overfitting, while unconfident data is harder to generalize. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels. We then interpolate them to bridge the confidence gap and apply a Mixup-based regularization to support learning on these additional, interpolated examples. We demonstrate the effectiveness of SFTMix in both instruction-following and healthcare-specific SFT tasks, with consistent improvements across LLM families and SFT datasets of varying sizes and qualities. Extensive analyses across six directions highlight SFTMix's compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

TL;DR

SFTMix tackles the high cost of acquiring high quality instruction tuning data by introducing a confidence aware Mixup strategy. By partitioning data into confident and unconfident regions based on training dynamics and interpolating between them, SFTMix regularizes learning and mitigates overfitting while improving generalization. Across instruction following and domain specific tasks, SFTMix yields consistent improvements over traditional next token prediction baselines and remains compatible with data selection and parameter efficient tuning approaches. The work demonstrates practical gains across multiple LLM families and highlights the potential for broader applications in efficient data utilization for instruction tuning.

Abstract

To acquire instruction-following capabilities, large language models (LLMs) undergo instruction tuning, where they are trained on instruction-response pairs using next-token prediction (NTP). Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning (SFT) datasets, typically requiring data filtering with proprietary LLMs or human annotation. In this paper, we take a different approach by proposing SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning without relying on well-curated datasets. We observe that LLMs exhibit uneven confidence across the semantic representation space. We argue that examples with different confidence levels should play distinct roles in instruction tuning: Confident data is prone to overfitting, while unconfident data is harder to generalize. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels. We then interpolate them to bridge the confidence gap and apply a Mixup-based regularization to support learning on these additional, interpolated examples. We demonstrate the effectiveness of SFTMix in both instruction-following and healthcare-specific SFT tasks, with consistent improvements across LLM families and SFT datasets of varying sizes and qualities. Extensive analyses across six directions highlight SFTMix's compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.
Paper Structure (37 sections, 7 equations, 4 figures, 12 tables)

This paper contains 37 sections, 7 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Embeddings of $2{,}500$ most and $2{,}500$ least confident examples in Alpaca-52K by Llama-3.1-8B trained using NTP. The clear separation between these embeddings suggests that the LLM exhibits varying confidence levels across different semantic regions.
  • Figure 2: The overall pipeline of the three-stage SFTMix recipe for LLM instruction tuning.
  • Figure 3: Confidence distributions from instruction-tuning Llama on datasets of varying qualities. On the y-axis, "High" represents higher-quality examples from GPT-4, while "Low" denotes lower-quality original examples from Alpaca-52K. Llama's confidence distributions show substantial overlap across these datasets.
  • Figure 4: A qualitative example from the extraction category in MT-Bench. Compared to its NTP-tuned counterpart, Llama instruction-tuned by SFTMix accurately interprets the queries from both turns and correctly extracts the relevant information from the prompt.