Table of Contents
Fetching ...

Hybrid Training Approaches for LLMs: Leveraging Real and Synthetic Data to Enhance Model Performance in Domain-Specific Applications

Alexey Zhezherau, Alexei Yanockin

TL;DR

The paper investigates filling real-data gaps for domain-specific LLM fine-tuning by combining real counseling data with high-quality synthetic data. It constructs a 500-session dataset (300+ real sessions plus synthetic sessions), generates diverse synthetic personas and CBT-centric scenarios, and evaluates three model variants (base, real-data fine-tuned, hybrid fine-tuned) on empathy and relevance. Empirical results show the hybrid model achieving superior averages and reduced variability, indicating better robustness and contextual sensitivity in therapy-like interactions. The work demonstrates that synthetic data can complement scarce real data to enhance performance in sensitive domains, with implications for scalable, context-aware AI applications in mental health support.

Abstract

This research explores a hybrid approach to fine-tuning large language models (LLMs) by integrating real-world and synthetic data to boost model performance, particularly in generating accurate and contextually relevant responses. By leveraging a dataset combining transcribed real interactions with high-quality synthetic sessions, we aimed to overcome the limitations of scarce, noisy, and domain-specific real data. Synthetic personas and scenarios were employed to enhance training diversity. The study evaluated three models: a base foundational model, a model fine-tuned with real data, and a hybrid fine-tuned model. Experimental results showed that the hybrid model consistently outperformed the others in specific vertical applications, achieving the highest scores across all metrics. Further testing confirmed the hybrid model's superior adaptability and contextual understanding across diverse scenarios. These findings suggest that combining real and synthetic data can significantly improve the robustness and contextual sensitivity of LLMs, particularly in domain-specific and vertical use cases.

Hybrid Training Approaches for LLMs: Leveraging Real and Synthetic Data to Enhance Model Performance in Domain-Specific Applications

TL;DR

The paper investigates filling real-data gaps for domain-specific LLM fine-tuning by combining real counseling data with high-quality synthetic data. It constructs a 500-session dataset (300+ real sessions plus synthetic sessions), generates diverse synthetic personas and CBT-centric scenarios, and evaluates three model variants (base, real-data fine-tuned, hybrid fine-tuned) on empathy and relevance. Empirical results show the hybrid model achieving superior averages and reduced variability, indicating better robustness and contextual sensitivity in therapy-like interactions. The work demonstrates that synthetic data can complement scarce real data to enhance performance in sensitive domains, with implications for scalable, context-aware AI applications in mental health support.

Abstract

This research explores a hybrid approach to fine-tuning large language models (LLMs) by integrating real-world and synthetic data to boost model performance, particularly in generating accurate and contextually relevant responses. By leveraging a dataset combining transcribed real interactions with high-quality synthetic sessions, we aimed to overcome the limitations of scarce, noisy, and domain-specific real data. Synthetic personas and scenarios were employed to enhance training diversity. The study evaluated three models: a base foundational model, a model fine-tuned with real data, and a hybrid fine-tuned model. Experimental results showed that the hybrid model consistently outperformed the others in specific vertical applications, achieving the highest scores across all metrics. Further testing confirmed the hybrid model's superior adaptability and contextual understanding across diverse scenarios. These findings suggest that combining real and synthetic data can significantly improve the robustness and contextual sensitivity of LLMs, particularly in domain-specific and vertical use cases.

Paper Structure

This paper contains 21 sections, 7 figures.

Figures (7)

  • Figure 1: Flowchart showing the steps involved in generating and refining synthetic therapy sessions.
  • Figure 2: Model Performance Summary
  • Figure 3: Distribution of Scores by Model
  • Figure 4: Distribution of Scores by Model, Violin chart
  • Figure 5: Empathy vs Relevance Scores
  • ...and 2 more figures