Table of Contents
Fetching ...

MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

Renze Lou, Kai Zhang, Jian Xie, Yuxuan Sun, Janice Ahn, Hanzi Xu, Yu Su, Wenpeng Yin

TL;DR

MUFFIN introduces Scaling Tasks per Input, a novel data-curation paradigm that diversifies instructions for each input rather than merely expanding the input-output pairs or expanding input-free tasks. The pipeline combines input facet brainstorming, instruction rematching of human-crafted prompts, and a classification-expansion mechanism to balance task types, yielding a rich, per-input instruction distribution. Across four zero-shot benchmarks and multiple model scales, models trained on Muffin consistently exhibit superior instruction-following capabilities relative to Scaling-Inputs and Scaling Input-Free baselines, with strong human-evaluated task solutions. This work offers a scalable, signal-rich data-curation approach for instruction tuning and highlights the importance of per-input task diversification in robust zero-shot generalization.

Abstract

In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes: i) Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence. ii) Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair (without requiring a separate input anymore). However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in instruction following when dealing with instances in Scaling-Inputs. This work introduces MUFFIN, a new scheme of instruction-following dataset curation. Specifically, we automatically Scale Tasks per Input by diversifying these tasks with various input facets. Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.

MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

TL;DR

MUFFIN introduces Scaling Tasks per Input, a novel data-curation paradigm that diversifies instructions for each input rather than merely expanding the input-output pairs or expanding input-free tasks. The pipeline combines input facet brainstorming, instruction rematching of human-crafted prompts, and a classification-expansion mechanism to balance task types, yielding a rich, per-input instruction distribution. Across four zero-shot benchmarks and multiple model scales, models trained on Muffin consistently exhibit superior instruction-following capabilities relative to Scaling-Inputs and Scaling Input-Free baselines, with strong human-evaluated task solutions. This work offers a scalable, signal-rich data-curation approach for instruction tuning and highlights the importance of per-input task diversification in robust zero-shot generalization.

Abstract

In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes: i) Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence. ii) Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair (without requiring a separate input anymore). However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in instruction following when dealing with instances in Scaling-Inputs. This work introduces MUFFIN, a new scheme of instruction-following dataset curation. Specifically, we automatically Scale Tasks per Input by diversifying these tasks with various input facets. Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.
Paper Structure (51 sections, 10 figures, 18 tables)

This paper contains 51 sections, 10 figures, 18 tables.

Figures (10)

  • Figure 1: Three different paradigms for designing instruction-following datasets.
  • Figure 2: Data construction pipeline of Muffin.
  • Figure 3: Human evaluation on the data quality. Both valid and invalid instances can be found in Table \ref{['tab:data_cases_with_validity']}. A4 indicates the joint set of successful cases in A1, A2, and A3.
  • Figure 4: The scaling trends comparison between Muffin and the previous baseline datasets (average performances on all four benchmarks).
  • Figure 5: The performance of the mixture of Muffin and SuperNI.
  • ...and 5 more figures