Table of Contents
Fetching ...

A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

Noureldin Zahran, Aya E. Fouda, Radwa J. Hanafy, Mohammed E. Fouda

TL;DR

This work evaluates eight diverse LLMs on Arabic mental health tasks across native and translated datasets to understand cross-lingual performance and prompting effects. It demonstrates that prompt design, especially structured prompts, strongly influences diagnostic accuracy, with multi-class tasks being particularly sensitive. Phi-3.5 MoE and Mistral NeMo emerge as top performers for balanced accuracy and mean absolute error, respectively, while few-shot prompting yields substantial gains, notably for GPT-4o Mini. The findings highlight translation biases, dataset difficulty, and the value of targeted prompting strategies for culturally aware, Arabic-speaking mental health diagnostics, while outlining avenues for improved data quality and model adaptation.

Abstract

Mental health disorders pose a growing public health concern in the Arab world, emphasizing the need for accessible diagnostic and intervention tools. Large language models (LLMs) offer a promising approach, but their application in Arabic contexts faces challenges including limited labeled datasets, linguistic complexity, and translation biases. This study comprehensively evaluates 8 LLMs, including general multi-lingual models, as well as bi-lingual ones, on diverse mental health datasets (such as AraDepSu, Dreaddit, MedMCQA), investigating the impact of prompt design, language configuration (native Arabic vs. translated English, and vice versa), and few-shot prompting on diagnostic performance. We find that prompt engineering significantly influences LLM scores mainly due to reduced instruction following, with our structured prompt outperforming a less structured variant on multi-class datasets, with an average difference of 14.5\%. While language influence on performance was modest, model selection proved crucial: Phi-3.5 MoE excelled in balanced accuracy, particularly for binary classification, while Mistral NeMo showed superior performance in mean absolute error for severity prediction tasks. Few-shot prompting consistently improved performance, with particularly substantial gains observed for GPT-4o Mini on multi-class classification, boosting accuracy by an average factor of 1.58. These findings underscore the importance of prompt optimization, multilingual analysis, and few-shot learning for developing culturally sensitive and effective LLM-based mental health tools for Arabic-speaking populations.

A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

TL;DR

This work evaluates eight diverse LLMs on Arabic mental health tasks across native and translated datasets to understand cross-lingual performance and prompting effects. It demonstrates that prompt design, especially structured prompts, strongly influences diagnostic accuracy, with multi-class tasks being particularly sensitive. Phi-3.5 MoE and Mistral NeMo emerge as top performers for balanced accuracy and mean absolute error, respectively, while few-shot prompting yields substantial gains, notably for GPT-4o Mini. The findings highlight translation biases, dataset difficulty, and the value of targeted prompting strategies for culturally aware, Arabic-speaking mental health diagnostics, while outlining avenues for improved data quality and model adaptation.

Abstract

Mental health disorders pose a growing public health concern in the Arab world, emphasizing the need for accessible diagnostic and intervention tools. Large language models (LLMs) offer a promising approach, but their application in Arabic contexts faces challenges including limited labeled datasets, linguistic complexity, and translation biases. This study comprehensively evaluates 8 LLMs, including general multi-lingual models, as well as bi-lingual ones, on diverse mental health datasets (such as AraDepSu, Dreaddit, MedMCQA), investigating the impact of prompt design, language configuration (native Arabic vs. translated English, and vice versa), and few-shot prompting on diagnostic performance. We find that prompt engineering significantly influences LLM scores mainly due to reduced instruction following, with our structured prompt outperforming a less structured variant on multi-class datasets, with an average difference of 14.5\%. While language influence on performance was modest, model selection proved crucial: Phi-3.5 MoE excelled in balanced accuracy, particularly for binary classification, while Mistral NeMo showed superior performance in mean absolute error for severity prediction tasks. Few-shot prompting consistently improved performance, with particularly substantial gains observed for GPT-4o Mini on multi-class classification, boosting accuracy by an average factor of 1.58. These findings underscore the importance of prompt optimization, multilingual analysis, and few-shot learning for developing culturally sensitive and effective LLM-based mental health tools for Arabic-speaking populations.
Paper Structure (40 sections, 5 equations, 15 figures, 12 tables)

This paper contains 40 sections, 5 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Example of a formatted prompt (ZS-2) for binary depression classification.
  • Figure 2: Example of a formatted prompt (ZS-2) for a binary classification task on depression.
  • Figure 3: Percentage of invalid responses for each dataset-model trial using the ZS-1 Prompt. Columns with a maximum invalid response percentage below 5% were omitted for clarity.
  • Figure 4: Percentage of invalid responses for each dataset-model trial using the ZS-2 Prompt. Columns with a maximum invalid response percentage below 5% were omitted for clarity.
  • Figure 5: Comparison of performance of models on BA for prompts ZS-1 and ZS-2 across all datasets, measuring relative improvement percentage.
  • ...and 10 more figures