Unlocking LLMs: Addressing Scarce Data and Bias Challenges in Mental Health
Vivek Kumar, Eirini Ntoutsi, Pushpraj Singh Rajawat, Giacomo Medda, Diego Reforgiato Recupero
TL;DR
This work tackles data scarcity and bias in applying LLMs to mental health by building IC-AnnoMI, an expert-annotated MI dialogue dataset generated with targeted in-context prompting of ChatGPT and labeled under the Motivational Interviewing Skill Code (MISC). The authors evaluate multiple ML and transformer baselines on utterance-level quality classification, showing that augmented data improves balanced accuracy (e.g., DistilBERT achieving the top balanced accuracy) and that expert evaluation supports the quality of generated dialogues ($MI_{psych}$ and $MI_{ling}$). They demonstrate that progressive prompting can yield in-context MI dialogues that approximate $MI_{org.}$, while highlighting risks like hallucinations and the need for human-in-the-loop supervision. The dataset and code are public, providing a resource for the MI community and guiding responsible deployment of LLMs for empathetic therapy in low-resource, sensitive domains.
Abstract
Large language models (LLMs) have shown promising capabilities in healthcare analysis but face several challenges like hallucinations, parroting, and bias manifestation. These challenges are exacerbated in complex, sensitive, and low-resource domains. Therefore, in this work we introduce IC-AnnoMI, an expert-annotated motivational interviewing (MI) dataset built upon AnnoMI by generating in-context conversational dialogues leveraging LLMs, particularly ChatGPT. IC-AnnoMI employs targeted prompts accurately engineered through cues and tailored information, taking into account therapy style (empathy, reflection), contextual relevance, and false semantic change. Subsequently, the dialogues are annotated by experts, strictly adhering to the Motivational Interviewing Skills Code (MISC), focusing on both the psychological and linguistic dimensions of MI dialogues. We comprehensively evaluate the IC-AnnoMI dataset and ChatGPT's emotional reasoning ability and understanding of domain intricacies by modeling novel classification tasks employing several classical machine learning and current state-of-the-art transformer approaches. Finally, we discuss the effects of progressive prompting strategies and the impact of augmented data in mitigating the biases manifested in IC-AnnoM. Our contributions provide the MI community with not only a comprehensive dataset but also valuable insights for using LLMs in empathetic text generation for conversational therapy in supervised settings.
