Table of Contents
Fetching ...

Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning

Fan Yin, Jesse Vig, Philippe Laban, Shafiq Joty, Caiming Xiong, Chien-Sheng Jason Wu

TL;DR

This work critically examines what parts of human-written task definitions actually drive instruction learning in large language models. Through human-annotated ablations and a syntax-guided compression method, it shows that label information is highly influential while many other details can be discarded, and demonstrates that substantial compression can even improve performance. It then proposes two practical strategies—structuring definitions as JSON-like triplets and meta-tuning to align model writing styles—that yield notable improvements on unseen tasks, especially for smaller models. The findings have practical implications for designing concise, robust, and shareable task definitions that enhance zero-shot generalization and efficiency in instruction learning across benchmarks like NIv2.

Abstract

Large language models (LLMs) have shown impressive performance in following natural language instructions to solve unseen tasks. However, it remains unclear whether models truly understand task definitions and whether the human-written definitions are optimal. In this paper, we systematically study the role of task definitions in instruction learning. We first conduct an ablation analysis informed by human annotations to understand which parts of a task definition are most important, and find that model performance only drops substantially when removing contents describing the task output, in particular label information. Next, we propose an automatic algorithm to compress task definitions to a minimal supporting set of tokens, and find that 60\% of tokens can be removed while maintaining or even improving model performance. Based on these results, we propose two strategies to help models better leverage task instructions: (1) providing only key information for tasks in a common structured format, and (2) adding a meta-tuning stage to help the model better understand the definitions. With these two strategies, we achieve a 4.2 Rouge-L improvement over 119 unseen test tasks.

Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning

TL;DR

This work critically examines what parts of human-written task definitions actually drive instruction learning in large language models. Through human-annotated ablations and a syntax-guided compression method, it shows that label information is highly influential while many other details can be discarded, and demonstrates that substantial compression can even improve performance. It then proposes two practical strategies—structuring definitions as JSON-like triplets and meta-tuning to align model writing styles—that yield notable improvements on unseen tasks, especially for smaller models. The findings have practical implications for designing concise, robust, and shareable task definitions that enhance zero-shot generalization and efficiency in instruction learning across benchmarks like NIv2.

Abstract

Large language models (LLMs) have shown impressive performance in following natural language instructions to solve unseen tasks. However, it remains unclear whether models truly understand task definitions and whether the human-written definitions are optimal. In this paper, we systematically study the role of task definitions in instruction learning. We first conduct an ablation analysis informed by human annotations to understand which parts of a task definition are most important, and find that model performance only drops substantially when removing contents describing the task output, in particular label information. Next, we propose an automatic algorithm to compress task definitions to a minimal supporting set of tokens, and find that 60\% of tokens can be removed while maintaining or even improving model performance. Based on these results, we propose two strategies to help models better leverage task instructions: (1) providing only key information for tasks in a common structured format, and (2) adding a meta-tuning stage to help the model better understand the definitions. With these two strategies, we achieve a 4.2 Rouge-L improvement over 119 unseen test tasks.
Paper Structure (43 sections, 4 figures, 9 tables, 1 algorithm)

This paper contains 43 sections, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Annotations of three examples that cover the eight categories of content in task definitions.
  • Figure 2: The compression ratio for each task category. Models tend to need less definition information for generation tasks compared to classification.
  • Figure 3: The number of each content category in original and compressed definitions. We put the numerical value of the fraction of kept content on top of each bar.
  • Figure 4: Example compressions of task definitions, with retained content highlighted in green.