Table of Contents
Fetching ...

Is In-Context Learning Sufficient for Instruction Following in LLMs?

Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

TL;DR

The paper rigorously evaluates whether in-context learning alone can achieve instruction following in large language models, focusing on URIAL prompts and comparing them to instruction fine-tuning on MT-Bench. It demonstrates that decoding configurations and the quality of demonstrations are critical for ICL effectiveness, and shows that high-quality, carefully selected in-context demonstrations can close part of the gap to IFT, though not fully for multi-turn interactions. A systematic comparison reveals that ICL and IFT are nearly equivalent for single-turn tasks in the low-data regime, while IFT generalizes better to multi-turn conversations. The work provides actionable insights into when ICL is viable versus when fine-tuning remains superior, and releases code to facilitate replication and further exploration.

Abstract

In-context learning (ICL) allows LLMs to learn from examples without changing their weights: this is a particularly promising capability for long-context LLMs that can potentially learn from many examples. Recently, Lin et al. (2024) proposed URIAL, a method using only three in-context examples to align base LLMs, achieving non-trivial instruction following performance. In this work, we show that, while effective, ICL alignment with URIAL still underperforms compared to instruction fine-tuning on the established benchmark MT-Bench, especially with more capable base LLMs. We then uncover the most relevant elements for successful in-context alignment, finding the crucial role of the decoding parameters. Based on these insights, we show that the approach of URIAL can indeed be improved by adding high-quality, potentially carefully selected via greedy search, demonstrations in context, getting closer to the performance of instruct models. Finally, we provide the first, to our knowledge, systematic comparison of ICL and instruction fine-tuning (IFT) for instruction following in the low data regime, where ICL can be a viable alternative to IFT. Overall, our work advances the understanding of ICL as an alignment technique and its relationship to IFT. We provide our code at https://github.com/tml-epfl/icl-alignment.

Is In-Context Learning Sufficient for Instruction Following in LLMs?

TL;DR

The paper rigorously evaluates whether in-context learning alone can achieve instruction following in large language models, focusing on URIAL prompts and comparing them to instruction fine-tuning on MT-Bench. It demonstrates that decoding configurations and the quality of demonstrations are critical for ICL effectiveness, and shows that high-quality, carefully selected in-context demonstrations can close part of the gap to IFT, though not fully for multi-turn interactions. A systematic comparison reveals that ICL and IFT are nearly equivalent for single-turn tasks in the low-data regime, while IFT generalizes better to multi-turn conversations. The work provides actionable insights into when ICL is viable versus when fine-tuning remains superior, and releases code to facilitate replication and further exploration.

Abstract

In-context learning (ICL) allows LLMs to learn from examples without changing their weights: this is a particularly promising capability for long-context LLMs that can potentially learn from many examples. Recently, Lin et al. (2024) proposed URIAL, a method using only three in-context examples to align base LLMs, achieving non-trivial instruction following performance. In this work, we show that, while effective, ICL alignment with URIAL still underperforms compared to instruction fine-tuning on the established benchmark MT-Bench, especially with more capable base LLMs. We then uncover the most relevant elements for successful in-context alignment, finding the crucial role of the decoding parameters. Based on these insights, we show that the approach of URIAL can indeed be improved by adding high-quality, potentially carefully selected via greedy search, demonstrations in context, getting closer to the performance of instruct models. Finally, we provide the first, to our knowledge, systematic comparison of ICL and instruction fine-tuning (IFT) for instruction following in the low data regime, where ICL can be a viable alternative to IFT. Overall, our work advances the understanding of ICL as an alignment technique and its relationship to IFT. We provide our code at https://github.com/tml-epfl/icl-alignment.
Paper Structure (31 sections, 14 figures, 11 tables)

This paper contains 31 sections, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Effect of decoding parameters on the 1st-turn MT-Bench scores across models. We vary temperature and top-$p$, with a fixed repetition penalty of 1.15 as used in the Urial codebase. The heatmaps show that the answering quality of the base model (Mistral-7B-v0.2) with and without Urial is sensitive to the decoding schemes. Conversely, the performance of the instruct model is robust to varying the decoding parameters. Surprisingly, with proper decoding parameters, the base model alone is already capable of following instructions. See the complete results in App. \ref{['sec:more_decoding_scheme_exps']}.
  • Figure 2: Influence of the individual components of Urial. We report 1st and 2nd turn MT-Bench score for every subset of the three demonstrations of Urial with Mistral-7B-v0.2. We test each configuration with ("Rule + Examples") and without ("Examples-Only") the Urial set of rules in the in-context prompt. We observe a clear increasing trend in 1st-turn score with more examples, but a decrease in the 2nd-turn performance. The set of rules does not seem to influence the results. The randomness for 0 and 3 examples is caused by the small fluctuations in the score of the GPT-4 judge.
  • Figure 3: Scaling the number of demonstrations for alignment with ICL on Mistral-7B-v0.2 and Llama-3.1-8B. We measure the alignment performance of different settings using the MT-Bench score. ICL with more demonstrations quickly saturates and does not bridge the performance gap between the base model and its aligned counterpart. In particular, the ICL alignment performance of 3 random examples from the high-quality SkillMix dataset surpasses that of 3 examples from Urial.
  • Figure 4: The distribution of the 1st-turn MT-Bench score (GPT-4-Turbo as judge) on Llama-3.1-8B obtained by adding multiple instructions from SkillMix kaur2024instruct as a 4th (a), 5th (b), 6th (c) demonstration to Urial. The dashed lines of various colors refer to the 1st-turn MT-Bench score of the obtained searching results. A majority of the 4th examples contribute positively to the model's instruction-following performance, but the improvement quickly diminishes when running the greedy search for 5th and 6th demonstrations.
  • Figure 5: Comparison of ICL vs IFT for alignment in the low data regime. We measure the alignment performance of different settings for Mistral-7B-v0.2 and Llama-3.1-8B using the MT-Bench score. IFT with more demonstrations keeps improving the alignment performance, almost bridging the gap between the base model and its aligned counterpart. IFT-aligned models perform well on multi-turn conversations, unlike with ICL. Finally, data quality has significant impact on both IFT and ICL: the higher-quality SkillMix leads to better performance than Evol-Instruct.
  • ...and 9 more figures