Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

Dongjin Kang; Sunghwan Kim; Taeyoon Kwon; Seungjun Moon; Hyunsouk Cho; Youngjae Yu; Dongha Lee; Jinyoung Yeo

Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

Dongjin Kang, Sunghwan Kim, Taeyoon Kwon, Seungjun Moon, Hyunsouk Cho, Youngjae Yu, Dongha Lee, Jinyoung Yeo

TL;DR

This paper examines why large language models struggle to provide quality emotional support in ESC tasks by focusing on strategy selection. It introduces a strategy-centric framework with metrics for proficiency, preference, and bias, and reveals that models' strong strategy preferences reduce robustness across ESC stages. The authors test both self- and external-contact mitigation approaches, finding that external assistance, particularly strategy planners and external knowledge, aligns with the Contact Hypothesis and improves proficiency, reduces bias, and enhances human-evaluated support quality. The work provides actionable directions for designing emotionally intelligent LLMs, including external-assisted strategy planning and careful prompt design, while highlighting limitations and future directions in safe, effective ESC deployments.

Abstract

Emotional Support Conversation (ESC) is a task aimed at alleviating individuals' emotional distress through daily conversation. Given its inherent complexity and non-intuitive nature, ESConv dataset incorporates support strategies to facilitate the generation of appropriate responses. Recently, despite the remarkable conversational ability of large language models (LLMs), previous studies have suggested that they often struggle with providing useful emotional support. Hence, this work initially analyzes the results of LLMs on ESConv, revealing challenges in selecting the correct strategy and a notable preference for a specific strategy. Motivated by these, we explore the impact of the inherent preference in LLMs on providing emotional support, and consequently, we observe that exhibiting high preference for specific strategies hinders effective emotional support, aggravating its robustness in predicting the appropriate strategy. Moreover, we conduct a methodological study to offer insights into the necessary approaches for LLMs to serve as proficient emotional supporters. Our findings emphasize that (1) low preference for specific strategies hinders the progress of emotional support, (2) external assistance helps reduce preference bias, and (3) existing LLMs alone cannot become good emotional supporters. These insights suggest promising avenues for future research to enhance the emotional intelligence of LLMs.

Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

TL;DR

Abstract

Paper Structure (80 sections, 11 equations, 16 figures, 19 tables)

This paper contains 80 sections, 11 equations, 16 figures, 19 tables.

Introduction
Preliminaries & Related Work
Emotional Support Conversation
Incorporating Strategies into ESC Systems
Emotional Support from LLMs
Evaluation Setup
Task and Focus
Task: emotional support response generation.
Focus: strategy-centric analysis.
Evaluation Set
Metrics
Proficiency.
Preference.
Preference Bias.
Proficiency and Preference of LLMs on Strategy
...and 65 more sections

Figures (16)

Figure 1: An example of an emotional support conversation with the analysis on the results of LLMs. LLMs tend to excessively prefer one or two specific strategies. Details about experiments are in Appendix \ref{['app:motivation_llms']}.
Figure 2: The results of strategy-constrained responses on both automated and human evaluation, showing the efficacy of strategy on ChatGPT. Appropriate strategy significantly enhances the quality of emotional support responses. The details are in Appendix \ref{['app:importance_of_strategy']}.
Figure 3: The details of LLMs' proficiency and preference. (a) The results of the weighted F1 score on each test set $D_t$, where the red dashed line indicates the proficiency $\mathcal{Q}$ for the entire test set $D$. (b) The preference ($p_i$) for each strategy, where the gray dashed line ($p_i = 1$) represents the threshold for preferring or not preferring the respective strategy, the average preference of strategies belonging to each stage, and the preference bias $\mathcal{B}$ below each LLM.
Figure 4: The results of iterations on Direct-Refine and Self-Refine in ChatGPT. To mitigate preference bias, strategies with $p_i>1$ should lean towards the negative direction, while strategies with $p_i<1$ should lean towards the positive direction as the iteration progresses.
Figure 5: The weighted-F1 scores for each test set ($D_t$) and the macro-F1 score $\mathcal{Q}$ for the entire test set ($D$) on ChatGPT and LLaMA2. Self- and external-contact are backgrounded with gray and yellow, respectively.
...and 11 more figures

Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

TL;DR

Abstract

Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

Authors

TL;DR

Abstract

Table of Contents

Figures (16)