Table of Contents
Fetching ...

Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications

Chenhua Shi, Gregor Macdonald, Bhavika Jalli, Wanlu Lei, John Zou, Mridul Jain, Joji Philip

TL;DR

This work tackles the high-cost data requirement for fine-tuning LLMs in specialized telecom domains by introducing a fully automated, multi-stage pipeline that grounds synthetic QA data in a domain knowledge graph. It combines a HippoRAG-based retriever, a base generator, and a refinement model, followed by customized RAGAS metrics to filter high-quality samples for reinforcement fine-tuning. The approach yields diverse, context-rich, and procedurally correct troubleshooting data, demonstrated on RAN/Telco troubleshooting tasks and the TeleQuAD dataset, with the hybrid generation strategy outperforming base-only and instruct-tuned-only baselines. The method offers a scalable path to domain-specific LLM adaptation, significantly reducing reliance on expert labeling while preserving technical fidelity and practical utility.

Abstract

The success of large language models (LLMs) depends heavily on large-scale, high-quality instruction-following and reinforcement datasets. However, generating such data through human annotation is prohibitively time-consuming particularly for domain-specific tasks like telecom network troubleshooting, where accurate responses require deep technical expertise and contextual understanding. In this paper, we present a fully automated, retrieval-augmented pipeline for generating synthetic question-answer (QA) pairs grounded in structured domain knowledge. Our multi-stage framework integrates a retriever, base generator, and refinement model to synthesize and enhance QA pairs using documents retrieved from a domain-specific knowledge graph. To ensure data quality, we employ customized RAGAS-based scoring to filter low-quality samples, producing a high-quality dataset suitable for reinforcement fine-tuning (RFT). We demonstrate our approach in a real-world telecom scenario focused on radio access network (RAN) troubleshooting. The resulting pipeline generates complex, context-rich troubleshooting solution plans without human intervention. This work offers a scalable solution for building instruction and reinforcement datasets in specialized domains, significantly reducing dependence on manual labeling while maintaining high technical fidelity.

Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications

TL;DR

This work tackles the high-cost data requirement for fine-tuning LLMs in specialized telecom domains by introducing a fully automated, multi-stage pipeline that grounds synthetic QA data in a domain knowledge graph. It combines a HippoRAG-based retriever, a base generator, and a refinement model, followed by customized RAGAS metrics to filter high-quality samples for reinforcement fine-tuning. The approach yields diverse, context-rich, and procedurally correct troubleshooting data, demonstrated on RAN/Telco troubleshooting tasks and the TeleQuAD dataset, with the hybrid generation strategy outperforming base-only and instruct-tuned-only baselines. The method offers a scalable path to domain-specific LLM adaptation, significantly reducing reliance on expert labeling while preserving technical fidelity and practical utility.

Abstract

The success of large language models (LLMs) depends heavily on large-scale, high-quality instruction-following and reinforcement datasets. However, generating such data through human annotation is prohibitively time-consuming particularly for domain-specific tasks like telecom network troubleshooting, where accurate responses require deep technical expertise and contextual understanding. In this paper, we present a fully automated, retrieval-augmented pipeline for generating synthetic question-answer (QA) pairs grounded in structured domain knowledge. Our multi-stage framework integrates a retriever, base generator, and refinement model to synthesize and enhance QA pairs using documents retrieved from a domain-specific knowledge graph. To ensure data quality, we employ customized RAGAS-based scoring to filter low-quality samples, producing a high-quality dataset suitable for reinforcement fine-tuning (RFT). We demonstrate our approach in a real-world telecom scenario focused on radio access network (RAN) troubleshooting. The resulting pipeline generates complex, context-rich troubleshooting solution plans without human intervention. This work offers a scalable solution for building instruction and reinforcement datasets in specialized domains, significantly reducing dependence on manual labeling while maintaining high technical fidelity.

Paper Structure

This paper contains 18 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The architecture diagram for Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning LLMs.
  • Figure 2: Distribution of RAGAS metrics between Seed and Synthetic Dataset.
  • Figure 3: Example question–answer pair generated from the synthetic dataset.
  • Figure 4: Distribution of Pairwise Similarities between Seed and Synthetic Dataset.
  • Figure 5: Distribution of Pairwise Similarities between Refiner Qwen3-32B and Qwen3-8B in the Synthetic Dataset.
  • ...and 1 more figures