Table of Contents
Fetching ...

Building Multilingual Datasets for Predicting Mental Health Severity through LLMs: Prospects and Challenges

Konstantinos Skianis, John Pavlopoulos, A. Seza Doğruöz

TL;DR

This study tackles the gap in multilingual mental health assessment by translating English social-media datasets into seven languages and evaluating LLMs on depression severity and suicide-risk classification. It uses a multilingual translation-and-prompting pipeline with GPT-3.5, GPT-4o-mini, and Llama-3.1 to predict severity across Dep-Severity and C-SSRS datasets, revealing substantial cross-language performance variability and the potential benefits of translation steps. The authors conduct extensive analyses, including zero- to few-shot prompts, back-translation, and cross-language comparisons, and report costs under $30 for GPT-3.5 experiments, highlighting scalability and practical constraints. They conclude with a cautious stance on fully automated mental health diagnosis, advocate human-in-the-loop oversight, and propose expanding to more languages and tasks while examining translation-induced biases and resource limitations.

Abstract

Large Language Models (LLMs) are increasingly being integrated into various medical fields, including mental health support systems. However, there is a gap in research regarding the effectiveness of LLMs in non-English mental health support applications. To address this problem, we present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages (e.g., Greek, Turkish, French, Portuguese, German, and Finnish). This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages. By experimenting with GPT and Llama, we observe considerable variability in performance across languages, despite being evaluated on the same translated dataset. This inconsistency underscores the complexities inherent in multilingual mental health support, where language-specific nuances and mental health data coverage can affect the accuracy of the models. Through comprehensive error analysis, we emphasize the risks of relying exclusively on LLMs in medical settings (e.g., their potential to contribute to misdiagnoses). Moreover, our proposed approach offers significant cost savings for multilingual tasks, presenting a major advantage for broad-scale implementation.

Building Multilingual Datasets for Predicting Mental Health Severity through LLMs: Prospects and Challenges

TL;DR

This study tackles the gap in multilingual mental health assessment by translating English social-media datasets into seven languages and evaluating LLMs on depression severity and suicide-risk classification. It uses a multilingual translation-and-prompting pipeline with GPT-3.5, GPT-4o-mini, and Llama-3.1 to predict severity across Dep-Severity and C-SSRS datasets, revealing substantial cross-language performance variability and the potential benefits of translation steps. The authors conduct extensive analyses, including zero- to few-shot prompts, back-translation, and cross-language comparisons, and report costs under $30 for GPT-3.5 experiments, highlighting scalability and practical constraints. They conclude with a cautious stance on fully automated mental health diagnosis, advocate human-in-the-loop oversight, and propose expanding to more languages and tasks while examining translation-induced biases and resource limitations.

Abstract

Large Language Models (LLMs) are increasingly being integrated into various medical fields, including mental health support systems. However, there is a gap in research regarding the effectiveness of LLMs in non-English mental health support applications. To address this problem, we present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages (e.g., Greek, Turkish, French, Portuguese, German, and Finnish). This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages. By experimenting with GPT and Llama, we observe considerable variability in performance across languages, despite being evaluated on the same translated dataset. This inconsistency underscores the complexities inherent in multilingual mental health support, where language-specific nuances and mental health data coverage can affect the accuracy of the models. Through comprehensive error analysis, we emphasize the risks of relying exclusively on LLMs in medical settings (e.g., their potential to contribute to misdiagnoses). Moreover, our proposed approach offers significant cost savings for multilingual tasks, presenting a major advantage for broad-scale implementation.
Paper Structure (28 sections, 4 figures, 9 tables)

This paper contains 28 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: An illustration of our proposed methodology.
  • Figure 2: F1 of GPT-3.5 across languages (vertically) per class (box) for Dep-Severity.
  • Figure 3: F1 of GPT-4o-mini across languages (vertically) per class (box) for C-ssrs.
  • Figure 4: An example of translation (English to Greek) from the C-ssrs dataset with GPT-4o-mini.