Table of Contents
Fetching ...

Building Domain-Specific Small Language Models via Guided Data Generation

Aman Kumar, Ekant Muljibhai Amin, Xian Yeow Lee, Lasitha Vidyaratne, Ahmed K. Farahat, Dipanjan D. Ghosh, Yuta Koreeda, Chetan Gupta

TL;DR

The paper presents DiagnosticSLM, a 3B-domain-specific LLM tailored for automotive diagnostics, built through a cost-efficient, three-stage pipeline: Domain-Adaptive Pretraining (DAPT) on a curated automotive corpus, Domain-specific Supervised Fine-tuning (DSFT) with a mix of task-focused and Alpaca data, and Direct Preference Optimization (DPO) for alignment. A guided data-generation strategy combines bottom-up data curation with teacher-model augmentation, producing robust automotive corpora for pretraining and supervision. Evaluation across four automotive benchmarks shows DiagnosticSLM achieving up to 45.32% accuracy on MCQ tasks, outperforming larger open-source models, with strong results on QA, sentence completion, and summarization, and ablation confirms the complementary value of each training stage. The work demonstrates a practical path to privacy-preserving, on-premise automotive AI with guardrails and deployment pipelines, and highlights future directions in retrieval-augmented inference and parameter-efficient fine-tuning for broader applicability.

Abstract

Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter domain-specific model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating effective domain-specific reasoning and generalization capabilities.

Building Domain-Specific Small Language Models via Guided Data Generation

TL;DR

The paper presents DiagnosticSLM, a 3B-domain-specific LLM tailored for automotive diagnostics, built through a cost-efficient, three-stage pipeline: Domain-Adaptive Pretraining (DAPT) on a curated automotive corpus, Domain-specific Supervised Fine-tuning (DSFT) with a mix of task-focused and Alpaca data, and Direct Preference Optimization (DPO) for alignment. A guided data-generation strategy combines bottom-up data curation with teacher-model augmentation, producing robust automotive corpora for pretraining and supervision. Evaluation across four automotive benchmarks shows DiagnosticSLM achieving up to 45.32% accuracy on MCQ tasks, outperforming larger open-source models, with strong results on QA, sentence completion, and summarization, and ablation confirms the complementary value of each training stage. The work demonstrates a practical path to privacy-preserving, on-premise automotive AI with guardrails and deployment pipelines, and highlights future directions in retrieval-augmented inference and parameter-efficient fine-tuning for broader applicability.

Abstract

Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter domain-specific model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating effective domain-specific reasoning and generalization capabilities.

Paper Structure

This paper contains 31 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the DiagnosticSLM pipeline. The figure illustrates the key stages of our approach, including domain-specific data collection, guided synthetic data generation using a teacher model, and a three-stage training process comprising Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO).
  • Figure 2: DiagnosticSLM deployment architecture
  • Figure 3: Shift of cosine similarity distribution from old to newly generated data
  • Figure 4: JSON representation of an example from our evaluation datasets.