Table of Contents
Fetching ...

Synth-Empathy: Towards High-Quality Synthetic Empathy Data

Hao Liang, Linzhuang Sun, Jingxuan Wei, Xijie Huang, Linkun Sun, Bihui Yu, Conghui He, Wentao Zhang

TL;DR

Synth-Empathy tackles data scarcity and labeling costs for empathetic LLMs by proposing a three-stage data-generation and curation pipeline: generate synthetic empathetic data from prompts anchored to the EmpatheticDialogues (ED) corpus, apply a two-step quality filter with an empathetic discriminator and a similarity-based selector (where $S = \frac{E_D \cdot E_G}{\|E_D\| \|E_G\|}$ and a threshold $T$), and enforce diversity via a $K$-Center Greedy selection. The curated high-quality data, when used to fine-tune LLMs, achieves state-of-the-art performance on multiple automatic empathy benchmarks and human evaluations, while revealing a robust trade-off between data quantity and quality. The approach demonstrates strong generalization and practical potential by requiring no human labeling, improving data coherence, naturalness, and empathy, and delivering competitive or superior results against strong baselines. Overall, Synth-Empathy offers a data-centric solution that scales empathetic capabilities of LLMs and provides actionable guidance on data generation, filtering, and diversification for real-world deployment.

Abstract

In recent years, with the rapid advancements in large language models (LLMs), achieving excellent empathetic response capabilities has become a crucial prerequisite. Consequently, managing and understanding empathetic datasets have gained increasing significance. However, empathetic data are typically human-labeled, leading to insufficient datasets and wasted human labor. In this work, we present Synth-Empathy, an LLM-based data generation and quality and diversity selection pipeline that automatically generates high-quality empathetic data while discarding low-quality data. With the data generated from a low empathetic model, we are able to further improve empathetic response performance and achieve state-of-the-art (SoTA) results across multiple benchmarks. Moreover, our model achieves SoTA performance on various human evaluation benchmarks, demonstrating its effectiveness and robustness in real-world applications. Furthermore, we show the trade-off between data quantity and quality, providing insights into empathetic data generation and selection.

Synth-Empathy: Towards High-Quality Synthetic Empathy Data

TL;DR

Synth-Empathy tackles data scarcity and labeling costs for empathetic LLMs by proposing a three-stage data-generation and curation pipeline: generate synthetic empathetic data from prompts anchored to the EmpatheticDialogues (ED) corpus, apply a two-step quality filter with an empathetic discriminator and a similarity-based selector (where and a threshold ), and enforce diversity via a -Center Greedy selection. The curated high-quality data, when used to fine-tune LLMs, achieves state-of-the-art performance on multiple automatic empathy benchmarks and human evaluations, while revealing a robust trade-off between data quantity and quality. The approach demonstrates strong generalization and practical potential by requiring no human labeling, improving data coherence, naturalness, and empathy, and delivering competitive or superior results against strong baselines. Overall, Synth-Empathy offers a data-centric solution that scales empathetic capabilities of LLMs and provides actionable guidance on data generation, filtering, and diversification for real-world deployment.

Abstract

In recent years, with the rapid advancements in large language models (LLMs), achieving excellent empathetic response capabilities has become a crucial prerequisite. Consequently, managing and understanding empathetic datasets have gained increasing significance. However, empathetic data are typically human-labeled, leading to insufficient datasets and wasted human labor. In this work, we present Synth-Empathy, an LLM-based data generation and quality and diversity selection pipeline that automatically generates high-quality empathetic data while discarding low-quality data. With the data generated from a low empathetic model, we are able to further improve empathetic response performance and achieve state-of-the-art (SoTA) results across multiple benchmarks. Moreover, our model achieves SoTA performance on various human evaluation benchmarks, demonstrating its effectiveness and robustness in real-world applications. Furthermore, we show the trade-off between data quantity and quality, providing insights into empathetic data generation and selection.
Paper Structure (27 sections, 3 equations, 8 figures, 4 tables)

This paper contains 27 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison of our Synth-Empathy data-trained model with previous SoTA models. The results demonstrate that our model achieves superior performance on multiple empathetic benchmarks.
  • Figure 2: Comparison of Data Examples. (a) An example from the ED dataset. (b) An example from the synthetic dataset.
  • Figure 3: Empathetic Data Generation and Curation Pipeline, which is composed by (1) Empathetic Data Generation module, (2) Quality Data Selection module, (3) Diversity Data Selection module and (4) Empathetic Model Training module.
  • Figure 4: Data Quality Evaluation Prompts. (a) Assessing the coherence of the data. (b) Assessing the naturalness of the data. (c) Assessing the empathy of the data.
  • Figure 5: Scores of Coherence, Naturalness, and Empathy for Generated Data. (a) Scores before applying the data filtering strategy. (b) Scores after applying the data filtering strategy.
  • ...and 3 more figures