Table of Contents
Fetching ...

Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Yuqing Wang, Yun Zhao, Sara Alessandra Keller, Anne de Hond, Marieke M. van Buchem, Malvika Pillai, Tina Hernandez-Boussard

TL;DR

This work systematically evaluates bias in large language models for mental health analysis by enriching prompts with demographic context across eight diverse datasets and ten models. It introduces a bias evaluation pipeline and four fairness-aware prompting strategies, showing that larger models like GPT-4 can achieve strong performance with fairer behavior when guided appropriately, though domain-specific models such as MentalRoBERTa often outperform LLMs on accuracy and fairness. Few-shot Chain-of-Thought prompting generally improves both performance and fairness, and fairness-aware prompts consistently reduce bias with limited performance loss. The findings suggest that model size, domain-specific adaptation, and targeted prompting jointly influence fairness in mental health applications, with practical implications for safer deployment and future research directions in high-stakes NLP tasks.

Abstract

The advancement of large language models (LLMs) has demonstrated strong capabilities across various applications, including mental health analysis. However, existing studies have focused on predictive performance, leaving the critical issue of fairness underexplored, posing significant risks to vulnerable populations. Despite acknowledging potential biases, previous works have lacked thorough investigations into these biases and their impacts. To address this gap, we systematically evaluate biases across seven social factors (e.g., gender, age, religion) using ten LLMs with different prompting methods on eight diverse mental health datasets. Our results show that GPT-4 achieves the best overall balance in performance and fairness among LLMs, although it still lags behind domain-specific models like MentalRoBERTa in some cases. Additionally, our tailored fairness-aware prompts can effectively mitigate bias in mental health predictions, highlighting the great potential for fair analysis in this field.

Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

TL;DR

This work systematically evaluates bias in large language models for mental health analysis by enriching prompts with demographic context across eight diverse datasets and ten models. It introduces a bias evaluation pipeline and four fairness-aware prompting strategies, showing that larger models like GPT-4 can achieve strong performance with fairer behavior when guided appropriately, though domain-specific models such as MentalRoBERTa often outperform LLMs on accuracy and fairness. Few-shot Chain-of-Thought prompting generally improves both performance and fairness, and fairness-aware prompts consistently reduce bias with limited performance loss. The findings suggest that model size, domain-specific adaptation, and targeted prompting jointly influence fairness in mental health applications, with practical implications for safer deployment and future research directions in high-stakes NLP tasks.

Abstract

The advancement of large language models (LLMs) has demonstrated strong capabilities across various applications, including mental health analysis. However, existing studies have focused on predictive performance, leaving the critical issue of fairness underexplored, posing significant risks to vulnerable populations. Despite acknowledging potential biases, previous works have lacked thorough investigations into these biases and their impacts. To address this gap, we systematically evaluate biases across seven social factors (e.g., gender, age, religion) using ten LLMs with different prompting methods on eight diverse mental health datasets. Our results show that GPT-4 achieves the best overall balance in performance and fairness among LLMs, although it still lags behind domain-specific models like MentalRoBERTa in some cases. Additionally, our tailored fairness-aware prompts can effectively mitigate bias in mental health predictions, highlighting the great potential for fair analysis in this field.
Paper Structure (40 sections, 3 figures, 5 tables)

This paper contains 40 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The pipeline for evaluating and mitigating bias in LLMs for mental health analysis. User queries undergo demographic enrichment to identify biases. LLM responses are evaluated for performance and fairness. Bias mitigation is applied through fairness-aware prompts to achieve clinically accepted EO scores.
  • Figure 2: Average F1 and EO scores across datasets, ordered by model size (indicated in parentheses). BERT-based models demonstrate superior performance and fairness. For LLMs, as model size increases, performance generally improves (higher F1 scores), and fairness improves (lower EO scores).
  • Figure 3: Average F1 and EO scores for all demographic factors on four models. For each model, the results are averaged over all datasets. Note that Llama3-8B and GPT-4 are based on zero-shot scenarios.