The Colorful Future of LLMs: Evaluating and Improving LLMs as Emotional Supporters for Queer Youth

Shir Lissak; Nitay Calderon; Geva Shenkman; Yaakov Ophir; Eyal Fruchter; Anat Brunstein Klomek; Roi Reichart

The Colorful Future of LLMs: Evaluating and Improving LLMs as Emotional Supporters for Queer Youth

Shir Lissak, Nitay Calderon, Geva Shenkman, Yaakov Ophir, Eyal Fruchter, Anat Brunstein Klomek, Roi Reichart

TL;DR

This work investigates whether Large Language Models can serve as empathetic emotional supporters for LGBTQ+ youth, who face elevated mental health risks and barriers to traditional help. It develops a novel ten-question evaluation scale and builds the LGBTeen dataset from Reddit to benchmark eight SOTA LLMs against human responses, revealing that while LLMs can be supportive, they are often generic and lack personalization and reliability. A targeted 'Guided Supporter' prompt partially mitigates these issues, and the authors propose a four-component blueprint for a reliable, empathetic, and personalized AI queer supporter, including alignment, identification, and assertion modules plus a queer-dedicated data collection. The paper discusses ethical considerations and limitations, highlighting that AI-based queer youth support is promising as an initial resource but not a substitute for professional help, and it provides a foundation for future research and dataset sharing.

Abstract

Queer youth face increased mental health risks, such as depression, anxiety, and suicidal ideation. Hindered by negative stigma, they often avoid seeking help and rely on online resources, which may provide incompatible information. Although access to a supportive environment and reliable information is invaluable, many queer youth worldwide have no access to such support. However, this could soon change due to the rapid adoption of Large Language Models (LLMs) such as ChatGPT. This paper aims to comprehensively explore the potential of LLMs to revolutionize emotional support for queers. To this end, we conduct a qualitative and quantitative analysis of LLM's interactions with queer-related content. To evaluate response quality, we develop a novel ten-question scale that is inspired by psychological standards and expert input. We apply this scale to score several LLMs and human comments to posts where queer youth seek advice and share experiences. We find that LLM responses are supportive and inclusive, outscoring humans. However, they tend to be generic, not empathetic enough, and lack personalization, resulting in nonreliable and potentially harmful advice. We discuss these challenges, demonstrate that a dedicated prompt can improve the performance, and propose a blueprint of an LLM-supporter that actively (but sensitively) seeks user context to provide personalized, empathetic, and reliable responses. Our annotated dataset is available for further research.

The Colorful Future of LLMs: Evaluating and Improving LLMs as Emotional Supporters for Queer Youth

TL;DR

Abstract

Paper Structure (26 sections, 15 figures, 5 tables)

This paper contains 26 sections, 15 figures, 5 tables.

Introduction
Background
The Promise of AI Queer Supporters
Analysis of the Current State
Qualitative Analysis
Quantitative Analysis
The LGBTeen Dataset
Results
The Future of LLM Queer Supporters: Reliability, Empathy, Personalization
Discussion
Limitations
Assessment Questionnaire
Low IAA in Subjective Tasks
Applicability to Other Populations
Other Non-English Languages
...and 11 more sections

Figures (15)

Figure 1: Comparison between the diversity of Reddit posts, human comments, and LLM responses (green solid lines, the thickest line is the mean trend). The average cosine similarity of the embeddings (Y-axis) is computed over the K most similar instances (X-axis) as follows: For each instance, we first find the K instances with the highest score and compute the mean score with the instance. Then, we average all these means. $\downarrow$ is better, indicating higher diversity.
Figure 2: Our proposed blueprint of an AI queer supporter consists of four core components: An aligned LLM, a queer-dedicated textual collection, an Identification component, and an Assertion component. The queer-dedicated collection is used for aligning the LLM and training the Identification and Assertion components. The collection should include reliable information and conversation examples that reflect safe, supportive, inclusive, and authentic interactions between queer youth and emotional supporters, and must also cover multiple personas with different socio-cultural traits. Notably, the Identification and the Assertion are external components of the LLM and may become redundant if it achieves satisfactory alignment. Overall, the ecosystems should support the following four functions: (1) Identification of queer-related information and support seeking intent; (2) User characterization including sensitive extraction of additional personal information and context (e.g., by guiding the LLM’s question-generation process); (3) Personalization (e.g., by retrieving related content and adjusting of the LLM prompt); and (4) Assertion that the generated responses are empathetic, safe and reliable. See Appendix §\ref{['sec:vision_appendix']} for full details.
Figure 3: Comparison between the diversity of Reddit posts, human comments and LLM responses (green solid lines, the thickest line is the mean trend). Average BLEU scores (Y-axis) are computed over the K most similar instances (X-axis) as follows: For each instance, we first find the K instances with the highest score and compute the mean score with the instance. Then, we average all these means. $\downarrow$ is better (higher diversity).
Figure 4: t-SNE visualization of the embeddings of 300 randomly sampled Reddit posts and their ChatGPT responses. As can be seen, ChatGPT responses are clustered together and located in three main clusters, while Reddit posts are more spread. This emphasizes our argument that they are generic and "templated".
Figure 5: A glimpse of our evaluation platform utilizing Label Studio software Label. The right side displays a post and two general information questions (queer identity and age). On the top left, we show another post paired with a response (most upvoted Reddit comment) that the evaluators annotate according to the ten-question questionnaire. Notice that we also provide the evaluator with a place to write comments. A useful feature is demonstrated in the bottom right: hovering the mouse over a response option (e.g., "Partially" of the LGBTQ+ Inclusiveness question) triggers a pop-up detailing the specific criteria for that selection.
...and 10 more figures

The Colorful Future of LLMs: Evaluating and Improving LLMs as Emotional Supporters for Queer Youth

TL;DR

Abstract

The Colorful Future of LLMs: Evaluating and Improving LLMs as Emotional Supporters for Queer Youth

Authors

TL;DR

Abstract

Table of Contents

Figures (15)