A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

Noureldin Zahran; Aya E. Fouda; Radwa J. Hanafy; Mohammed E. Fouda

A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

Noureldin Zahran, Aya E. Fouda, Radwa J. Hanafy, Mohammed E. Fouda

TL;DR

This work evaluates eight diverse LLMs on Arabic mental health tasks across native and translated datasets to understand cross-lingual performance and prompting effects. It demonstrates that prompt design, especially structured prompts, strongly influences diagnostic accuracy, with multi-class tasks being particularly sensitive. Phi-3.5 MoE and Mistral NeMo emerge as top performers for balanced accuracy and mean absolute error, respectively, while few-shot prompting yields substantial gains, notably for GPT-4o Mini. The findings highlight translation biases, dataset difficulty, and the value of targeted prompting strategies for culturally aware, Arabic-speaking mental health diagnostics, while outlining avenues for improved data quality and model adaptation.

Abstract

Mental health disorders pose a growing public health concern in the Arab world, emphasizing the need for accessible diagnostic and intervention tools. Large language models (LLMs) offer a promising approach, but their application in Arabic contexts faces challenges including limited labeled datasets, linguistic complexity, and translation biases. This study comprehensively evaluates 8 LLMs, including general multi-lingual models, as well as bi-lingual ones, on diverse mental health datasets (such as AraDepSu, Dreaddit, MedMCQA), investigating the impact of prompt design, language configuration (native Arabic vs. translated English, and vice versa), and few-shot prompting on diagnostic performance. We find that prompt engineering significantly influences LLM scores mainly due to reduced instruction following, with our structured prompt outperforming a less structured variant on multi-class datasets, with an average difference of 14.5\%. While language influence on performance was modest, model selection proved crucial: Phi-3.5 MoE excelled in balanced accuracy, particularly for binary classification, while Mistral NeMo showed superior performance in mean absolute error for severity prediction tasks. Few-shot prompting consistently improved performance, with particularly substantial gains observed for GPT-4o Mini on multi-class classification, boosting accuracy by an average factor of 1.58. These findings underscore the importance of prompt optimization, multilingual analysis, and few-shot learning for developing culturally sensitive and effective LLM-based mental health tools for Arabic-speaking populations.

A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

TL;DR

Abstract

Paper Structure (40 sections, 5 equations, 15 figures, 12 tables)

This paper contains 40 sections, 5 equations, 15 figures, 12 tables.

Introduction
Data Collection
Native Arabic Datasets
Depression Corpus of Arabic Tweets (DCAT) DCAT
Modern Standard Arabic Mood Changing and Depression Dataset (MCD) MCD
ARADEPSU ARADEPSU
CAIRODEP CAIRODEP
Twitter-based Arabic Mental Illness (AMI) AMI
Mental Disorder Egyptian Arabic Dialect (MDE) MDE
Translated English Datasets
DREADDITturcan2019dreaddit
SDCNL haque2021deep
SAD mauriello2021sad
DEPTWEET kabir2023deptweet
RED SAM sampath2022datakayalvizhi2022findings
...and 25 more sections

Figures (15)

Figure 1: Example of a formatted prompt (ZS-2) for binary depression classification.
Figure 2: Example of a formatted prompt (ZS-2) for a binary classification task on depression.
Figure 3: Percentage of invalid responses for each dataset-model trial using the ZS-1 Prompt. Columns with a maximum invalid response percentage below 5% were omitted for clarity.
Figure 4: Percentage of invalid responses for each dataset-model trial using the ZS-2 Prompt. Columns with a maximum invalid response percentage below 5% were omitted for clarity.
Figure 5: Comparison of performance of models on BA for prompts ZS-1 and ZS-2 across all datasets, measuring relative improvement percentage.
...and 10 more figures

A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

TL;DR

Abstract

A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

Authors

TL;DR

Abstract

Table of Contents

Figures (15)