Table of Contents
Fetching ...

Large Language Models for Mental Health Diagnostic Assessments: Exploring The Potential of Large Language Models for Assisting with Mental Health Diagnostic Assessments -- The Depression and Anxiety Case

Kaushik Roy, Harshul Surana, Darssan Eswaramoorthi, Yuxin Zi, Vedant Palit, Ritvik Garimella, Amit Sheth

TL;DR

This paper systematically evaluates the use of large language models to assist mental health diagnostic assessments by focusing on PHQ-9 for MDD and GAD-7 for GAD. It compares prompting-based and fine-tuning-based approaches using proprietary models (GPT-3.5, GPT-4o) and open-source models (llama-3.1-8b, mixtral-8x7b), along with two fine-tuned models (Mentalllama and DiagnosticLlama) trained on the PRIMATE-derived ground truth. Ground-truth datasets are created by clinician-annotated PRIMATE posts, with agreement measured by Cohen's kappa (0.74 for PHQ-9, 0.72 for GAD-7). The study finds that LLMs can approach human expert annotation quality, especially in few-shot settings, and introduces the DiagnosticLlama model and a suite of annotated datasets to spur further research, while highlighting ongoing gaps in replicating clinician-level diagnostic reasoning. The work has practical implications for reducing clinician workload and guiding the development of safer, instruction-tuned tools for mental health assessments, with future plans to expand to additional questionnaires and clinician-facing applications.

Abstract

Large language models (LLMs) are increasingly attracting the attention of healthcare professionals for their potential to assist in diagnostic assessments, which could alleviate the strain on the healthcare system caused by a high patient load and a shortage of providers. For LLMs to be effective in supporting diagnostic assessments, it is essential that they closely replicate the standard diagnostic procedures used by clinicians. In this paper, we specifically examine the diagnostic assessment processes described in the Patient Health Questionnaire-9 (PHQ-9) for major depressive disorder (MDD) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire for generalized anxiety disorder (GAD). We investigate various prompting and fine-tuning techniques to guide both proprietary and open-source LLMs in adhering to these processes, and we evaluate the agreement between LLM-generated diagnostic outcomes and expert-validated ground truth. For fine-tuning, we utilize the Mentalllama and Llama models, while for prompting, we experiment with proprietary models like GPT-3.5 and GPT-4o, as well as open-source models such as llama-3.1-8b and mixtral-8x7b.

Large Language Models for Mental Health Diagnostic Assessments: Exploring The Potential of Large Language Models for Assisting with Mental Health Diagnostic Assessments -- The Depression and Anxiety Case

TL;DR

This paper systematically evaluates the use of large language models to assist mental health diagnostic assessments by focusing on PHQ-9 for MDD and GAD-7 for GAD. It compares prompting-based and fine-tuning-based approaches using proprietary models (GPT-3.5, GPT-4o) and open-source models (llama-3.1-8b, mixtral-8x7b), along with two fine-tuned models (Mentalllama and DiagnosticLlama) trained on the PRIMATE-derived ground truth. Ground-truth datasets are created by clinician-annotated PRIMATE posts, with agreement measured by Cohen's kappa (0.74 for PHQ-9, 0.72 for GAD-7). The study finds that LLMs can approach human expert annotation quality, especially in few-shot settings, and introduces the DiagnosticLlama model and a suite of annotated datasets to spur further research, while highlighting ongoing gaps in replicating clinician-level diagnostic reasoning. The work has practical implications for reducing clinician workload and guiding the development of safer, instruction-tuned tools for mental health assessments, with future plans to expand to additional questionnaires and clinician-facing applications.

Abstract

Large language models (LLMs) are increasingly attracting the attention of healthcare professionals for their potential to assist in diagnostic assessments, which could alleviate the strain on the healthcare system caused by a high patient load and a shortage of providers. For LLMs to be effective in supporting diagnostic assessments, it is essential that they closely replicate the standard diagnostic procedures used by clinicians. In this paper, we specifically examine the diagnostic assessment processes described in the Patient Health Questionnaire-9 (PHQ-9) for major depressive disorder (MDD) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire for generalized anxiety disorder (GAD). We investigate various prompting and fine-tuning techniques to guide both proprietary and open-source LLMs in adhering to these processes, and we evaluate the agreement between LLM-generated diagnostic outcomes and expert-validated ground truth. For fine-tuning, we utilize the Mentalllama and Llama models, while for prompting, we experiment with proprietary models like GPT-3.5 and GPT-4o, as well as open-source models such as llama-3.1-8b and mixtral-8x7b.
Paper Structure (32 sections, 1 figure, 14 tables)