Table of Contents
Fetching ...

From Symbolic Tasks to Code Generation: Diversification Yields Better Task Performers

Dylan Zhang, Justin Wang, Francois Charton

TL;DR

The paper shows that instruction diversity, including semantic variety and cross-domain coverage, is a primary driver of generalization in instruction-tuned LLMs. Using a controlled Markov-algorithm-based string-rewrite setup, it demonstrates phase transitions in generalization with respect to the number of distinct instructions and the distribution of examples, and extends these insights to real-world code-generation tasks where broader data mixtures yield performance gains. However, excessive diversification across domains can dilute domain-specific strengths, revealing a trade-off that depends on mix ratios and task domain. Overall, diversifying the semantic space of instruction-tuning data significantly enhances an LLM's ability to follow instructions and perform unseen tasks, with practical implications for constructing instruction-following datasets across domains.

Abstract

Instruction tuning -- tuning large language models on instruction-output pairs -- is a promising technique for making models better adapted to the real world. Yet, the key factors driving the model's capability to understand and follow instructions not seen during training remain under-explored. Our investigation begins with a series of synthetic experiments within the theoretical framework of a Turing-complete algorithm called Markov algorithm, which allows fine-grained control over the instruction-tuning data. Generalization and robustness with respect to the training distribution emerge once a diverse enough set of tasks is provided, even though very few examples are provided for each task. We extend these initial results to a real-world application scenario of code generation and find that a more diverse instruction set, extending beyond code-related tasks, improves the performance of code generation. Our observations suggest that a more diverse semantic space for instruction-tuning sets greatly improves the model's ability to follow instructions and perform tasks.

From Symbolic Tasks to Code Generation: Diversification Yields Better Task Performers

TL;DR

The paper shows that instruction diversity, including semantic variety and cross-domain coverage, is a primary driver of generalization in instruction-tuned LLMs. Using a controlled Markov-algorithm-based string-rewrite setup, it demonstrates phase transitions in generalization with respect to the number of distinct instructions and the distribution of examples, and extends these insights to real-world code-generation tasks where broader data mixtures yield performance gains. However, excessive diversification across domains can dilute domain-specific strengths, revealing a trade-off that depends on mix ratios and task domain. Overall, diversifying the semantic space of instruction-tuning data significantly enhances an LLM's ability to follow instructions and perform unseen tasks, with practical implications for constructing instruction-following datasets across domains.

Abstract

Instruction tuning -- tuning large language models on instruction-output pairs -- is a promising technique for making models better adapted to the real world. Yet, the key factors driving the model's capability to understand and follow instructions not seen during training remain under-explored. Our investigation begins with a series of synthetic experiments within the theoretical framework of a Turing-complete algorithm called Markov algorithm, which allows fine-grained control over the instruction-tuning data. Generalization and robustness with respect to the training distribution emerge once a diverse enough set of tasks is provided, even though very few examples are provided for each task. We extend these initial results to a real-world application scenario of code generation and find that a more diverse instruction set, extending beyond code-related tasks, improves the performance of code generation. Our observations suggest that a more diverse semantic space for instruction-tuning sets greatly improves the model's ability to follow instructions and perform tasks.
Paper Structure (24 sections, 2 equations, 7 figures, 5 tables)

This paper contains 24 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of our symbolic tasks in this paper.
  • Figure 2: Generalization versus the number of instructions during training.
  • Figure 3: Generalization versus number of instructions during training.
  • Figure 4: Model's performance on $k<3$ when trained on the three classes of restricted semantics as in \ref{['sec:diversity_semantics']}. Models trained on 500 or less instructions never generalize to smaller k.
  • Figure 5: Generalization versus the number of instructions during training.
  • ...and 2 more figures