Table of Contents
Fetching ...

LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Aditi Jha, Sam Havens, Jeremy Dohmann, Alex Trott, Jacob Portes

TL;DR

LIMIT investigates whether a small, diverse instruction-tuning dataset can match traditional and model-based evaluation performance. Using MPT-7B and MPT-30B trained on Instruct-v1, Instruct-v3, and LIMA, the study contrasts Eval Gauntlet perplexity benchmarks with GPT-4 judged open-ended responses. Key finding: dataset composition, not size, largely determines performance; Instruct datasets excel on perplexity benchmarks while LIMA aligns better with open-ended evaluation, and a mixture of datasets yields robust cross-paradigm performance. This work provides practical guidance for efficient, reproducible instruction-tuning under resource constraints.

Abstract

Large Language Models are traditionally finetuned on large instruction datasets. However recent studies suggest that small, high-quality datasets can suffice for general purpose instruction following. This lack of consensus surrounding finetuning best practices is in part due to rapidly diverging approaches to LLM evaluation. In this study, we ask whether a small amount of diverse finetuning samples can improve performance on both traditional perplexity-based NLP benchmarks, and on open-ended, model-based evaluation. We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples. We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation. Finally, we show that mixing textbook-style and open-ended QA finetuning datasets optimizes performance on both evaluation paradigms.

LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

TL;DR

LIMIT investigates whether a small, diverse instruction-tuning dataset can match traditional and model-based evaluation performance. Using MPT-7B and MPT-30B trained on Instruct-v1, Instruct-v3, and LIMA, the study contrasts Eval Gauntlet perplexity benchmarks with GPT-4 judged open-ended responses. Key finding: dataset composition, not size, largely determines performance; Instruct datasets excel on perplexity benchmarks while LIMA aligns better with open-ended evaluation, and a mixture of datasets yields robust cross-paradigm performance. This work provides practical guidance for efficient, reproducible instruction-tuning under resource constraints.

Abstract

Large Language Models are traditionally finetuned on large instruction datasets. However recent studies suggest that small, high-quality datasets can suffice for general purpose instruction following. This lack of consensus surrounding finetuning best practices is in part due to rapidly diverging approaches to LLM evaluation. In this study, we ask whether a small amount of diverse finetuning samples can improve performance on both traditional perplexity-based NLP benchmarks, and on open-ended, model-based evaluation. We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples. We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation. Finally, we show that mixing textbook-style and open-ended QA finetuning datasets optimizes performance on both evaluation paradigms.
Paper Structure (29 sections, 12 figures, 6 tables)

This paper contains 29 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: How to finetune and evaluate LLMs for general purpose instruction following? (A) We finetune open-source LLMs MPT-7B and MPT-30B on datasets of varying sizes: Instruct-v1 and v3 which contain 56.2-59.3k instruction samples, and the LIMA dataset which contains 1,000 samples. (B) We then evaluate finetuned models using two paradigms: (1) traditional NLP perplexity-based evaluation on benchmarks such as MMLU and BIG-bench, as well as (2) model-based evaluation (via GPT-4) on open-ended generation.
  • Figure 2: Instruction finetuning training and test examples from the (A) Instruct-v1 (derived from Dolly-15k, HH-RLHF) and Instruct-v3 (derived from 9 diverse sources) training sets (B) LIMA training set, which contains open ended questions and multi-paragraph answers (C) LIMA test set (which similarly contains open ended questions) (D) MosaicML Eval Gauntlet test set, which contains trivia-like multiple choice questions.
  • Figure 3: Models finetuned on the Instruct datasets do better on traditional NLP benchmarks. Each plot shows the accuracy (between 0--1) of models on a given category of the MosaicML Eval Gauntlet, and the average score across all categories is shown in the first subplot. The two different model sizes (7B and 30B) are grouped into two bar graphs. We show results for the base models MPT-7B and MPT-30B (cyan), and for the base models finetuned on the LIMA dataset (midnight blue), subsets of the Instruct dataset (khaki), and the full Instruct dataset (vermilion).
  • Figure 4: Model-based evaluation on the LIMA test set prefers models finetuned on the LIMA training set. We use GPT-4 as the judge to perform model-based evaluation on the LIMA test set (300 samples). We show the preference rate of MPT models finetuned on a subset of Instruct and on the full Instruct datasets when compared to LIMA-finetuned MPT models. (Left) GPT-4 prefers responses from MPT-7B finetuned on 1,000 LIMA samples over responses from MPT-7B finetuned on a random subset of 5,000 samples from Instruct-v1. (Right) GPT-4 strongly prefers responses from MPT-30B finetuned on LIMA samples over responses from MPT-30B finetuned on (1) a random subset of 1,000 samples from Instruct-v3, and (2) the full 56,200 samples in Instruct-v3.
  • Figure 5: Models finetuned on the LIMA training set and a subset of the Instruct training set perform well across both evaluation paradigms. (A) Accuracy of finetuned models on each category of the MosaicML Eval Gauntlet, along with their average scores. MPT-7B and MPT-30B when finetuned on a subset of the Instruct datasets (5k samples from Instruct-v1 for 7B, 1k samples from Instruct-v3 for 30B) combined with the LIMA dataset perform very close to MPT-7B and MPT-30B finetuned on all of Instruct, respectively. (B) Model-based evaluation on the LIMA test set using GPT-4. (Top) MPT-7B finetuned on the combined dataset is preferred over MPT-7B finetuned with LIMA alone by a huge margin. (Bottom) MPT-30B finetuned on the combined dataset is preferred $46.7\%$ over MPT-30B finetuned on LIMA. In both cases, the preference rate of models finetuned on the combined dataset is higher than those finetuned on all of the Instruct datasets.
  • ...and 7 more figures