An Assessment on Comprehending Mental Health through Large Language Models
Mihael Arcan, David-Paul Niland, Fionn Delahunty
TL;DR
This study evaluates whether large language models can understand expressions of mental health by comparing Llama-2 and ChatGPT against classical ML methods and transformer baselines using the DAIC-WOZ dataset. It demonstrates that transformer-based models, such as Distil-RoBERTa and XLNet, generally outperform large LLMs on GAD/PHQ-4 prediction tasks, with Distil-RoBERTa yielding the strongest weighted metrics overall. The findings suggest task-specific transformers are currently more effective than general-purpose LLMs for mental health assessment, while also underscoring the importance of bias analysis and accounting for the dynamic nature of mental health in future work.
Abstract
Mental health challenges pose considerable global burdens on individuals and communities. Recent data indicates that more than 20% of adults may encounter at least one mental disorder in their lifetime. On the one hand, the advancements in large language models have facilitated diverse applications, yet a significant research gap persists in understanding and enhancing the potential of large language models within the domain of mental health. On the other hand, across various applications, an outstanding question involves the capacity of large language models to comprehend expressions of human mental health conditions in natural language. This study presents an initial evaluation of large language models in addressing this gap. Due to this, we compare the performance of Llama-2 and ChatGPT with classical Machine as well as Deep learning models. Our results on the DAIC-WOZ dataset show that transformer-based models, like BERT or XLNet, outperform the large language models.
