Table of Contents
Fetching ...

DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

Ruiyao Xu, Noelle I. Samia, Han Liu

Abstract

Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS$^2$-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.

DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

Abstract

Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.
Paper Structure (29 sections, 2 equations, 8 figures, 8 tables)

This paper contains 29 sections, 2 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Top: Traditional methods require human seed data with manual annotation. Middle: Document-based approaches requiring domain-specific corpora and preprocessing. Bottom: Our DS2-Instruct framework generates data using only task definitions through systematic keywords expansion and instruction generation.
  • Figure 2: Overview of the DS2-Instruct framework. Given only a task definition, our approach systematically generates domain-specific instruction-tuning datasets through three stages: ❶ keyword generation using bi-directional expansion and retrieval augmentation, ❷ cognitive-level instruction generation based on Bloom's Taxonomy, and ❸ self-consistency filtering for quality assurance.
  • Figure 3: Root verb-noun pairs in instructions of (a) Domain-Aware Self-Instruct and (b) DS2-Instruct in the math domain. The inner circle represents root verbs and the outer circle represents direct nouns. For each verb, we show the top 4 verb-noun pairs.
  • Figure 4: Impact of Training Data Size on Model Performance.
  • Figure 5: Performance on Models of Different Sizes. We evaluate the Qwen model family on Math and GPQA datasets.
  • ...and 3 more figures