Table of Contents
Fetching ...

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

TL;DR

The paper demonstrates that a remarkably simple baseline—training on the 1,000 longest instruction-following examples from standard datasets—can outperform sophisticated data-selection methods in instruction fine-tuning across multiple LLMs and benchmarks. It further shows that refining these long instructions via introspection prompts and augmentation (NEFTune) yields additional gains, achieving competitive results on MT-Bench, AlpacaEval 2.0, and open benchmarks while using only 1,000 examples. The authors conduct extensive evaluations with GPT-4 and PaLM-2 as judges, and include human preferences to verify improvements beyond length bias. Overall, the work argues that longest-response selection should be the default baseline for instruction fine-tuning and contributes a cost-effective, scalable approach with broad generalization across architectures and tasks.

Abstract

There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they? LIMA (NeurIPS 2023) and AlpaGasus (ICLR 2024) are state-of-the-art methods for selecting such high-quality examples, either via manual curation or using GPT-3.5-Turbo as a quality scorer. We show that the extremely simple baseline of selecting the 1,000 instructions with longest responses -- that intuitively contain more learnable information and are harder to overfit -- from standard datasets can consistently outperform these sophisticated methods according to GPT-4 and PaLM-2 as judges, while remaining competitive on the Open LLM benchmarks that test factual knowledge. We demonstrate this for several LLMs (Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1) and datasets (Alpaca-52k, Evol-Instruct-70k). In addition, a lightweight refinement of such long instructions can further improve the abilities of the fine-tuned LLMs, and allows us to obtain competitive results on MT-Bench and the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0, while training on only 1,000 examples and no extra preference data. We also conduct a thorough analysis of our models to ensure that their enhanced performance is not simply due to GPT-4's preference for longer responses. Overall, our findings suggest that fine-tuning on the longest responses should be the default baseline for any work on instruction fine-tuning. We provide our code at https://github.com/tml-epfl/long-is-more-for-alignment.

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

TL;DR

The paper demonstrates that a remarkably simple baseline—training on the 1,000 longest instruction-following examples from standard datasets—can outperform sophisticated data-selection methods in instruction fine-tuning across multiple LLMs and benchmarks. It further shows that refining these long instructions via introspection prompts and augmentation (NEFTune) yields additional gains, achieving competitive results on MT-Bench, AlpacaEval 2.0, and open benchmarks while using only 1,000 examples. The authors conduct extensive evaluations with GPT-4 and PaLM-2 as judges, and include human preferences to verify improvements beyond length bias. Overall, the work argues that longest-response selection should be the default baseline for instruction fine-tuning and contributes a cost-effective, scalable approach with broad generalization across architectures and tasks.

Abstract

There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they? LIMA (NeurIPS 2023) and AlpaGasus (ICLR 2024) are state-of-the-art methods for selecting such high-quality examples, either via manual curation or using GPT-3.5-Turbo as a quality scorer. We show that the extremely simple baseline of selecting the 1,000 instructions with longest responses -- that intuitively contain more learnable information and are harder to overfit -- from standard datasets can consistently outperform these sophisticated methods according to GPT-4 and PaLM-2 as judges, while remaining competitive on the Open LLM benchmarks that test factual knowledge. We demonstrate this for several LLMs (Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1) and datasets (Alpaca-52k, Evol-Instruct-70k). In addition, a lightweight refinement of such long instructions can further improve the abilities of the fine-tuned LLMs, and allows us to obtain competitive results on MT-Bench and the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0, while training on only 1,000 examples and no extra preference data. We also conduct a thorough analysis of our models to ensure that their enhanced performance is not simply due to GPT-4's preference for longer responses. Overall, our findings suggest that fine-tuning on the longest responses should be the default baseline for any work on instruction fine-tuning. We provide our code at https://github.com/tml-epfl/long-is-more-for-alignment.
Paper Structure (38 sections, 27 figures, 8 tables)

This paper contains 38 sections, 27 figures, 8 tables.

Figures (27)

  • Figure 1: Selecting the longest responses leads to a strong IFT dataset. We fine-tune LLaMA-2-7B models on Alpaca-52k alpaca, AlpaGasus-1k chen2023alpagasus, LIMA-1k zhou2023lima and our Alpaca-1k-longest datasets. (a) Alpaca-1k-longest beats three baselines in instruction-following performance according to both GPT-4 and PaLM-2 as judges. (b) Alpaca-1k-longest leads to an average response length at test time higher than Alpaca-52k and AlpaGasus-1k, but similar to LIMA-1k: then its higher win rate cannot be solely attributed to the model having learnt to generate long responses.
  • Figure 2: Detailed preference evaluation (in %). For each pair of LLMs we report the win rate on 5 datasets (LIMA, Vicuna, Koala, WizardLM, Self-Instruct) according to GPT-4-as-a-judge. Top: we compare fine-tuning on Alpaca-1k-longest (AP-1k-L) to Alpaca-52k, AlpaGasus-1k, and LIMA-1k. Bottom: we compare fine-tuning on Evol-Instruct-1k-longest (EI-1k-L) to Evol-Instruct-70k, Evol-Instruct-AlpaGasus-1k (i.e. using the method of chen2023alpagasus to subsample Evol-Instruct-70k), and LIMA-1k. Our datasets of long responses consistently lead to higher preferences (higher win rate) than the existing methods.
  • Figure 3: The template of introspection prompting used to refine the responses in terms of style, structure, and the level of details.
  • Figure 4: Refinement via introspection improves instruction-following performance across architectures. We report the average preference performance (%) across five evaluation sets using GPT-4 as a judge. We show win rate of models with different architectures fine-tuned on Alpaca-1k-longest against Alpaca-52k, AlpaGasus-1k, and LIMA-1k in blue (+ symbol). Additionally we illustrate the improvement brought by our Refined-Alpaca-1k-longest over LIMA-1k, the strongest baseline, in red (* symbol).
  • Figure 5: Open LLM Leaderboard tasks with Llama-2-7B fine-tuned on Alpaca-based datasets and LIMA. The model fine-tuned on Alpaca-1k-longest achieves comparable performance to that of AlpaGasus-1k on average, showing that the performance gain on instruction-following capability does not compromise factuality. Our Refined-Alpaca-1k-longest, with and without NEFTune, achieve the best results, surpassing LIMA-1k on all datasets.
  • ...and 22 more figures