Table of Contents
Fetching ...

Instruction Tuning with Human Curriculum

Bruce W. Lee, Hyunsoo Cho, Kang Min Yoo

TL;DR

The paper presents Corgi, a curriculum-inspired instruction-tuning framework that uses a synthetic, curriculum-rich dataset to train LLMs. By interleaving learning across subjects and progressing through Bloom’s taxonomy, the approach achieves substantial, data-efficient gains across nine benchmarks without extra compute. Key contributions include a three-step dataset construction process (concept extraction, synthetic instruction generation, and quality filtering) and a global-interleaving training regimen that outperforms blocking and unstructured curricula. The work demonstrates robust improvements on knowledge, reasoning, and language tasks, while also discussing limitations related to difficulty annotation and model-scale generalization.

Abstract

In this work, we (1) introduce Curriculum Instruction Tuning, (2) explore the potential advantages of employing diverse curriculum strategies, and (3) delineate a synthetic instruction-response generation framework that complements our theoretical approach. Distinct from the existing instruction tuning dataset, our generation pipeline is systematically structured to emulate the sequential and orderly characteristic of human learning. Additionally, we describe a methodology for generating instruction-response datasets that extensively span the various stages of human education, from middle school through the graduate level, utilizing educational subject catalogs. Before training, we meticulously organize the instruction data to ensure that questions escalate in difficulty regarding (A) the subject matter and (B) the intricacy of the instructions. The findings of our study reveal that substantial improvements in performance can be achieved through the mere application of curriculum ordering to instruction data (achieving gains of +4.76 on TruthfulQA, +2.98 on MMLU, +2.8 on OpenbookQA, and +1.28 on ARC-hard) compared to random shuffling. This enhancement is achieved without incurring additional computational expenses. Through comprehensive experimentation, we observe that the advantages of our proposed method are consistently evident across nine benchmarks.

Instruction Tuning with Human Curriculum

TL;DR

The paper presents Corgi, a curriculum-inspired instruction-tuning framework that uses a synthetic, curriculum-rich dataset to train LLMs. By interleaving learning across subjects and progressing through Bloom’s taxonomy, the approach achieves substantial, data-efficient gains across nine benchmarks without extra compute. Key contributions include a three-step dataset construction process (concept extraction, synthetic instruction generation, and quality filtering) and a global-interleaving training regimen that outperforms blocking and unstructured curricula. The work demonstrates robust improvements on knowledge, reasoning, and language tasks, while also discussing limitations related to difficulty annotation and model-scale generalization.

Abstract

In this work, we (1) introduce Curriculum Instruction Tuning, (2) explore the potential advantages of employing diverse curriculum strategies, and (3) delineate a synthetic instruction-response generation framework that complements our theoretical approach. Distinct from the existing instruction tuning dataset, our generation pipeline is systematically structured to emulate the sequential and orderly characteristic of human learning. Additionally, we describe a methodology for generating instruction-response datasets that extensively span the various stages of human education, from middle school through the graduate level, utilizing educational subject catalogs. Before training, we meticulously organize the instruction data to ensure that questions escalate in difficulty regarding (A) the subject matter and (B) the intricacy of the instructions. The findings of our study reveal that substantial improvements in performance can be achieved through the mere application of curriculum ordering to instruction data (achieving gains of +4.76 on TruthfulQA, +2.98 on MMLU, +2.8 on OpenbookQA, and +1.28 on ARC-hard) compared to random shuffling. This enhancement is achieved without incurring additional computational expenses. Through comprehensive experimentation, we observe that the advantages of our proposed method are consistently evident across nine benchmarks.
Paper Structure (28 sections, 9 figures, 6 tables)

This paper contains 28 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of our educational framework. We create a dataset based on a continuum from secondary school to grad school, extracting multiple concepts from each course. For every concept, we formulate 19 questions of varied cognitive levels using Bloom's taxonomy.
  • Figure 2: Overview of our proposed curriculum dataset construction steps, which preserves the progressive metadata of the concept difficulty and instruction-format difficulty. These characteristics allow the application of pedagogically motivated curriculum learning strategies, which we discuss further in Sections \ref{['sec:curriculum']} and \ref{['sec:curriculum-analysis']}.
  • Figure 3: A comparison of two training sequences. Small blocks (e.g., H1, M1) stand for fine-grained concepts per subject. Blocking naively stacks hierarchical blocks per subject, while interleaving cyclically revisits each subject, adhering to the cognitive hierarchy from Bloom's taxonomy.
  • Figure 4: (Continued from Figure 2) More examples of local progressions. A comparison of clustering and spiral training sequences. The clustering stacks hierarchical blocks for each concept, while the spiral cyclically revisits each concept and alternates cognitive difficulty from Bloom's taxonomy.
  • Figure 5: Local curriculum diminishes performance improvement. The figure shows a macroscopic, averaged performance comparison of several benchmark improvements with respect to the base model (LLaMA 2 13B) performance. World Knowledge: MMLU, TruthfulQA, TriviaQA, Commonsense Reasoning: OpenBookQA, ARC, PIQA, CommonsenseQA, Language Understanding: HellaSwag, and Lambada. A full breakdown of this chart is given in the Appendix H.
  • ...and 4 more figures