MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

Renze Lou; Kai Zhang; Jian Xie; Yuxuan Sun; Janice Ahn; Hanzi Xu; Yu Su; Wenpeng Yin

MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

Renze Lou, Kai Zhang, Jian Xie, Yuxuan Sun, Janice Ahn, Hanzi Xu, Yu Su, Wenpeng Yin

TL;DR

MUFFIN introduces Scaling Tasks per Input, a novel data-curation paradigm that diversifies instructions for each input rather than merely expanding the input-output pairs or expanding input-free tasks. The pipeline combines input facet brainstorming, instruction rematching of human-crafted prompts, and a classification-expansion mechanism to balance task types, yielding a rich, per-input instruction distribution. Across four zero-shot benchmarks and multiple model scales, models trained on Muffin consistently exhibit superior instruction-following capabilities relative to Scaling-Inputs and Scaling Input-Free baselines, with strong human-evaluated task solutions. This work offers a scalable, signal-rich data-curation approach for instruction tuning and highlights the importance of per-input task diversification in robust zero-shot generalization.

Abstract

In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes: i) Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence. ii) Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair (without requiring a separate input anymore). However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in instruction following when dealing with instances in Scaling-Inputs. This work introduces MUFFIN, a new scheme of instruction-following dataset curation. Specifically, we automatically Scale Tasks per Input by diversifying these tasks with various input facets. Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.

MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

TL;DR

Abstract

Paper Structure (51 sections, 10 figures, 18 tables)

This paper contains 51 sections, 10 figures, 18 tables.

Introduction
RELATED WORK
Muffin Curation
Input collection
Instruction Collection
Classification Expansion
Data Analyses
Experimental Setup
Evaluation Benchmarks.
Experimental Results
Automatic Evaluation
Human Evaluation
Analyses
Ablation study to answer $\mathcal{Q}_1$.
Resolving the task leaking concern in $\mathcal{Q}_2$.
...and 36 more sections

Figures (10)

Figure 1: Three different paradigms for designing instruction-following datasets.
Figure 2: Data construction pipeline of Muffin.
Figure 3: Human evaluation on the data quality. Both valid and invalid instances can be found in Table \ref{['tab:data_cases_with_validity']}. A4 indicates the joint set of successful cases in A1, A2, and A3.
Figure 4: The scaling trends comparison between Muffin and the previous baseline datasets (average performances on all four benchmarks).
Figure 5: The performance of the mixture of Muffin and SuperNI.
...and 5 more figures

MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

TL;DR

Abstract

MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

Authors

TL;DR

Abstract

Table of Contents

Figures (10)