Evaluating Large Language Models for Anxiety, Depression, and Stress Detection: Insights into Prompting Strategies and Synthetic Data

Mihael Arcan; David-Paul Niland

Evaluating Large Language Models for Anxiety, Depression, and Stress Detection: Insights into Prompting Strategies and Synthetic Data

Mihael Arcan, David-Paul Niland

TL;DR

This study systematically compares transformer-based and large language model approaches for detecting anxiety, depression, and stress from text, using the DAIC-WOZ and Stress Detection datasets. It evaluates fine-tuned transformers, LLM prompting with Llama and GPT-3.5 Turbo, and synthetic data augmentation across PHQ-2, GAD-2, and PHQ-4 tasks, highlighting Distil-RoBERTa as a strong performer and revealing nuanced benefits and pitfalls of synthetic data. Key findings show that prompting strategy and model choice markedly influence recall and precision, with zero-shot synthetic prompts improving stress detection but sometimes reducing precision in depression tasks. The results inform practical design choices for automated mental health assessment, emphasizing careful calibration and task-tailored prompting to balance performance and generalization.

Abstract

Mental health disorders affect over one-fifth of adults globally, yet detecting such conditions from text remains challenging due to the subtle and varied nature of symptom expression. This study evaluates multiple approaches for mental health detection, comparing Large Language Models (LLMs) such as Llama and GPT with classical machine learning and transformer-based architectures including BERT, XLNet, and Distil-RoBERTa. Using the DAIC-WOZ dataset of clinical interviews, we fine-tuned models for anxiety, depression, and stress classification and applied synthetic data generation to mitigate class imbalance. Results show that Distil-RoBERTa achieved the highest F1 score (0.883) for GAD-2, while XLNet outperformed others on PHQ tasks (F1 up to 0.891). For stress detection, a zero-shot synthetic approach (SD+Zero-Shot-Basic) reached an F1 of 0.884 and ROC AUC of 0.886. Findings demonstrate the effectiveness of transformer-based models and highlight the value of synthetic data in improving recall and generalization. However, careful calibration is required to prevent precision loss. Overall, this work emphasizes the potential of combining advanced language models and data augmentation to enhance automated mental health assessment from text.

Evaluating Large Language Models for Anxiety, Depression, and Stress Detection: Insights into Prompting Strategies and Synthetic Data

TL;DR

Abstract

Evaluating Large Language Models for Anxiety, Depression, and Stress Detection: Insights into Prompting Strategies and Synthetic Data

TL;DR

Abstract

Paper Structure

Table of Contents