SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Yuxin Xiao; Shujian Zhang; Wenxuan Zhou; Marzyeh Ghassemi; Sanqiang Zhao

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Yuxin Xiao, Shujian Zhang, Wenxuan Zhou, Marzyeh Ghassemi, Sanqiang Zhao

TL;DR

SFTMix tackles the high cost of acquiring high quality instruction tuning data by introducing a confidence aware Mixup strategy. By partitioning data into confident and unconfident regions based on training dynamics and interpolating between them, SFTMix regularizes learning and mitigates overfitting while improving generalization. Across instruction following and domain specific tasks, SFTMix yields consistent improvements over traditional next token prediction baselines and remains compatible with data selection and parameter efficient tuning approaches. The work demonstrates practical gains across multiple LLM families and highlights the potential for broader applications in efficient data utilization for instruction tuning.

Abstract

To acquire instruction-following capabilities, large language models (LLMs) undergo instruction tuning, where they are trained on instruction-response pairs using next-token prediction (NTP). Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning (SFT) datasets, typically requiring data filtering with proprietary LLMs or human annotation. In this paper, we take a different approach by proposing SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning without relying on well-curated datasets. We observe that LLMs exhibit uneven confidence across the semantic representation space. We argue that examples with different confidence levels should play distinct roles in instruction tuning: Confident data is prone to overfitting, while unconfident data is harder to generalize. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels. We then interpolate them to bridge the confidence gap and apply a Mixup-based regularization to support learning on these additional, interpolated examples. We demonstrate the effectiveness of SFTMix in both instruction-following and healthcare-specific SFT tasks, with consistent improvements across LLM families and SFT datasets of varying sizes and qualities. Extensive analyses across six directions highlight SFTMix's compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

TL;DR

Abstract

Paper Structure (37 sections, 7 equations, 4 figures, 12 tables)

This paper contains 37 sections, 7 equations, 4 figures, 12 tables.

Introduction
Related Work
LLM Instruction Tuning.
Data Characterization via Training Dynamics.
Mixup-Based Learning.
SFTMix
Preliminaries
The NTP Instruction-Tuning Paradigm.
LLM Confidence via Training Dynamics.
Motivation
Recipe
Step 1: Determine Subspaces with Distinct Confidence Levels.
Step 2: Linearly Interpolate Confident and Unconfident Examples.
Step 3: Incorporate a Mixup-Based Regularization.
Analysis
...and 22 more sections

Figures (4)

Figure 1: Embeddings of $2{,}500$ most and $2{,}500$ least confident examples in Alpaca-52K by Llama-3.1-8B trained using NTP. The clear separation between these embeddings suggests that the LLM exhibits varying confidence levels across different semantic regions.
Figure 2: The overall pipeline of the three-stage SFTMix recipe for LLM instruction tuning.
Figure 3: Confidence distributions from instruction-tuning Llama on datasets of varying qualities. On the y-axis, "High" represents higher-quality examples from GPT-4, while "Low" denotes lower-quality original examples from Alpaca-52K. Llama's confidence distributions show substantial overlap across these datasets.
Figure 4: A qualitative example from the extraction category in MT-Bench. Compared to its NTP-tuned counterpart, Llama instruction-tuned by SFTMix accurately interprets the queries from both turns and correctly extracts the relevant information from the prompt.

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

TL;DR

Abstract

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Authors

TL;DR

Abstract

Table of Contents

Figures (4)