The Colorful Future of LLMs: Evaluating and Improving LLMs as Emotional Supporters for Queer Youth
Shir Lissak, Nitay Calderon, Geva Shenkman, Yaakov Ophir, Eyal Fruchter, Anat Brunstein Klomek, Roi Reichart
TL;DR
This work investigates whether Large Language Models can serve as empathetic emotional supporters for LGBTQ+ youth, who face elevated mental health risks and barriers to traditional help. It develops a novel ten-question evaluation scale and builds the LGBTeen dataset from Reddit to benchmark eight SOTA LLMs against human responses, revealing that while LLMs can be supportive, they are often generic and lack personalization and reliability. A targeted 'Guided Supporter' prompt partially mitigates these issues, and the authors propose a four-component blueprint for a reliable, empathetic, and personalized AI queer supporter, including alignment, identification, and assertion modules plus a queer-dedicated data collection. The paper discusses ethical considerations and limitations, highlighting that AI-based queer youth support is promising as an initial resource but not a substitute for professional help, and it provides a foundation for future research and dataset sharing.
Abstract
Queer youth face increased mental health risks, such as depression, anxiety, and suicidal ideation. Hindered by negative stigma, they often avoid seeking help and rely on online resources, which may provide incompatible information. Although access to a supportive environment and reliable information is invaluable, many queer youth worldwide have no access to such support. However, this could soon change due to the rapid adoption of Large Language Models (LLMs) such as ChatGPT. This paper aims to comprehensively explore the potential of LLMs to revolutionize emotional support for queers. To this end, we conduct a qualitative and quantitative analysis of LLM's interactions with queer-related content. To evaluate response quality, we develop a novel ten-question scale that is inspired by psychological standards and expert input. We apply this scale to score several LLMs and human comments to posts where queer youth seek advice and share experiences. We find that LLM responses are supportive and inclusive, outscoring humans. However, they tend to be generic, not empathetic enough, and lack personalization, resulting in nonreliable and potentially harmful advice. We discuss these challenges, demonstrate that a dedicated prompt can improve the performance, and propose a blueprint of an LLM-supporter that actively (but sensitively) seeks user context to provide personalized, empathetic, and reliable responses. Our annotated dataset is available for further research.
