Synth-Empathy: Towards High-Quality Synthetic Empathy Data

Hao Liang; Linzhuang Sun; Jingxuan Wei; Xijie Huang; Linkun Sun; Bihui Yu; Conghui He; Wentao Zhang

Synth-Empathy: Towards High-Quality Synthetic Empathy Data

Hao Liang, Linzhuang Sun, Jingxuan Wei, Xijie Huang, Linkun Sun, Bihui Yu, Conghui He, Wentao Zhang

TL;DR

Synth-Empathy tackles data scarcity and labeling costs for empathetic LLMs by proposing a three-stage data-generation and curation pipeline: generate synthetic empathetic data from prompts anchored to the EmpatheticDialogues (ED) corpus, apply a two-step quality filter with an empathetic discriminator and a similarity-based selector (where $S = \frac{E_D \cdot E_G}{\|E_D\| \|E_G\|}$ and a threshold $T$), and enforce diversity via a $K$-Center Greedy selection. The curated high-quality data, when used to fine-tune LLMs, achieves state-of-the-art performance on multiple automatic empathy benchmarks and human evaluations, while revealing a robust trade-off between data quantity and quality. The approach demonstrates strong generalization and practical potential by requiring no human labeling, improving data coherence, naturalness, and empathy, and delivering competitive or superior results against strong baselines. Overall, Synth-Empathy offers a data-centric solution that scales empathetic capabilities of LLMs and provides actionable guidance on data generation, filtering, and diversification for real-world deployment.

Abstract

In recent years, with the rapid advancements in large language models (LLMs), achieving excellent empathetic response capabilities has become a crucial prerequisite. Consequently, managing and understanding empathetic datasets have gained increasing significance. However, empathetic data are typically human-labeled, leading to insufficient datasets and wasted human labor. In this work, we present Synth-Empathy, an LLM-based data generation and quality and diversity selection pipeline that automatically generates high-quality empathetic data while discarding low-quality data. With the data generated from a low empathetic model, we are able to further improve empathetic response performance and achieve state-of-the-art (SoTA) results across multiple benchmarks. Moreover, our model achieves SoTA performance on various human evaluation benchmarks, demonstrating its effectiveness and robustness in real-world applications. Furthermore, we show the trade-off between data quantity and quality, providing insights into empathetic data generation and selection.

Synth-Empathy: Towards High-Quality Synthetic Empathy Data

TL;DR

and a threshold

), and enforce diversity via a

-Center Greedy selection. The curated high-quality data, when used to fine-tune LLMs, achieves state-of-the-art performance on multiple automatic empathy benchmarks and human evaluations, while revealing a robust trade-off between data quantity and quality. The approach demonstrates strong generalization and practical potential by requiring no human labeling, improving data coherence, naturalness, and empathy, and delivering competitive or superior results against strong baselines. Overall, Synth-Empathy offers a data-centric solution that scales empathetic capabilities of LLMs and provides actionable guidance on data generation, filtering, and diversification for real-world deployment.

Abstract

Paper Structure (27 sections, 3 equations, 8 figures, 4 tables)

This paper contains 27 sections, 3 equations, 8 figures, 4 tables.

Introduction
Related Work
Empathetic Response Generation
Data Quality and Data Selection
Data Quality
Data Selection
Data Generation
Method
Empathetic Data Generation
Empathetic Data Quality Selection
Empathetic Discriminator
Similarity Based Quality Selection
Empathetic Data Diversity Selection
High Quality Generated Empathetic Data
Experiments
...and 12 more sections

Figures (8)

Figure 1: Comparison of our Synth-Empathy data-trained model with previous SoTA models. The results demonstrate that our model achieves superior performance on multiple empathetic benchmarks.
Figure 2: Comparison of Data Examples. (a) An example from the ED dataset. (b) An example from the synthetic dataset.
Figure 3: Empathetic Data Generation and Curation Pipeline, which is composed by (1) Empathetic Data Generation module, (2) Quality Data Selection module, (3) Diversity Data Selection module and (4) Empathetic Model Training module.
Figure 4: Data Quality Evaluation Prompts. (a) Assessing the coherence of the data. (b) Assessing the naturalness of the data. (c) Assessing the empathy of the data.
Figure 5: Scores of Coherence, Naturalness, and Empathy for Generated Data. (a) Scores before applying the data filtering strategy. (b) Scores after applying the data filtering strategy.
...and 3 more figures

Synth-Empathy: Towards High-Quality Synthetic Empathy Data

TL;DR

Abstract

Synth-Empathy: Towards High-Quality Synthetic Empathy Data

Authors

TL;DR

Abstract

Table of Contents

Figures (8)