Table of Contents
Fetching ...

Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning

Po-Nien Kung, Nanyun Peng

TL;DR

This work questions whether instruction-tuned models genuinely learn to follow instructions or simply exploit superficial cues. By systematically ablating semantic content in task definitions and task examples, and comparing to a random-output baseline, the authors show that simplified or misleading instructions can yield performance on par with original IT in low-resource settings. The findings imply that IT gains may largely reflect learning the output format and space rather than true instruction comprehension, underscoring the need for robust evaluation benchmarks and methods. The results call for more reliable IT paradigms and careful baselining to avoid overestimating instruction-following capabilities.

Abstract

Recent works on instruction tuning (IT) have achieved great performance with zero-shot generalizability to unseen tasks. With additional context (e.g., task definition, examples) provided to models for fine-tuning, they achieved much higher performance than untuned models. Despite impressive performance gains, what models learn from IT remains understudied. In this work, we analyze how models utilize instructions during IT by comparing model training with altered vs. original instructions. Specifically, we create simplified task definitions by removing all semantic components and only leaving the output space information, and delusive examples that contain incorrect input-output mapping. Our experiments show that models trained on simplified task definition or delusive examples can achieve comparable performance to the ones trained on the original instructions and examples. Furthermore, we introduce a random baseline to perform zeroshot classification tasks, and find it achieves similar performance (42.6% exact-match) as IT does (43% exact-match) in low resource setting, while both methods outperform naive T5 significantly (30% per exact-match). Our analysis provides evidence that the impressive performance gain of current IT models can come from picking up superficial patterns, such as learning the output format and guessing. Our study highlights the urgent need for more reliable IT methods and evaluation.

Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning

TL;DR

This work questions whether instruction-tuned models genuinely learn to follow instructions or simply exploit superficial cues. By systematically ablating semantic content in task definitions and task examples, and comparing to a random-output baseline, the authors show that simplified or misleading instructions can yield performance on par with original IT in low-resource settings. The findings imply that IT gains may largely reflect learning the output format and space rather than true instruction comprehension, underscoring the need for robust evaluation benchmarks and methods. The results call for more reliable IT paradigms and careful baselining to avoid overestimating instruction-following capabilities.

Abstract

Recent works on instruction tuning (IT) have achieved great performance with zero-shot generalizability to unseen tasks. With additional context (e.g., task definition, examples) provided to models for fine-tuning, they achieved much higher performance than untuned models. Despite impressive performance gains, what models learn from IT remains understudied. In this work, we analyze how models utilize instructions during IT by comparing model training with altered vs. original instructions. Specifically, we create simplified task definitions by removing all semantic components and only leaving the output space information, and delusive examples that contain incorrect input-output mapping. Our experiments show that models trained on simplified task definition or delusive examples can achieve comparable performance to the ones trained on the original instructions and examples. Furthermore, we introduce a random baseline to perform zeroshot classification tasks, and find it achieves similar performance (42.6% exact-match) as IT does (43% exact-match) in low resource setting, while both methods outperform naive T5 significantly (30% per exact-match). Our analysis provides evidence that the impressive performance gain of current IT models can come from picking up superficial patterns, such as learning the output format and guessing. Our study highlights the urgent need for more reliable IT methods and evaluation.
Paper Structure (29 sections, 4 figures, 6 tables)

This paper contains 29 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The left sub-figure demonstrates a two-stage pipeline where the model first trains on a set of tasks and then evaluates other unseen tasks. The model inputs task definition, examples, and instance input together to make a prediction. The two right sub-figures show how we create Simplified task definition and Delusive task example for ablation studies. We also demonstrate the results at the bottom with T5 w/o IT(Untuned models) results. It is shown that models can still achieve significant performance gain compared to T5 w/o IT while training on Simplified task definition and Delusive examples.
  • Figure 2: Controlled experiments for task definition. Original, Simplified, and Empty represent what type of task-definition the model is trained and tested with. T5 w/o IT(12.5) shows the score(12.5) of the baseline T5-large model. The top two subfigures show the main results evaluating classification tasks using Exact-Match (accuracy) and Generative tasks using Rouge-L. The bottom two sub-figures are supplementary results evaluating rouge-L for All tasks and classification tasks.
  • Figure 3: Controlled experiments for task examples. The left sub-figure shows the main results, where Original task examples are used for testing (in-context learning). Original, Delusive, and Empty represent what type of task examples are used for training and the T5 w/o IT is the baseline T5-large model. The right sub-figure shows supplementary results using Delusive examples for testing. The faint dashed lines are copied from the left sub-figure for comparison purposes.
  • Figure 4: Results for the Random Guessing baseline which randomly guesses an answer from the output space (labels). The left figure shows the format correctness, which calculates the accuracy of model predictions lied in the label space for classification (CLS) tasks. The right figure shows the average exact-match score of CLS tasks.