Table of Contents
Fetching ...

Comparing the Efficacy of GPT-4 and Chat-GPT in Mental Health Care: A Blind Assessment of Large Language Models for Psychological Support

Birger Moell

TL;DR

This study conducts a blind, clinician-led comparison of GPT-4 and Chat-GPT across 18 psychological prompts to assess suitability for mental health support. GPT-4 consistently yields higher, more clinically relevant and empathetic responses than Chat-GPT, though both models exhibit overconfidence relative to human evaluation. The authors discuss ethical guidelines, safety considerations, and the need for robust, multi-evaluator validation before clinical deployment. Overall, the findings suggest GPT-4–like models could augment mental health support under strict oversight and rigorous validation, with future work needed to generalize across populations and conditions.

Abstract

Background: Rapid advancements in natural language processing have led to the development of large language models with the potential to revolutionize mental health care. These models have shown promise in assisting clinicians and providing support to individuals experiencing various psychological challenges. Objective: This study aims to compare the performance of two large language models, GPT-4 and Chat-GPT, in responding to a set of 18 psychological prompts, to assess their potential applicability in mental health care settings. Methods: A blind methodology was employed, with a clinical psychologist evaluating the models' responses without knowledge of their origins. The prompts encompassed a diverse range of mental health topics, including depression, anxiety, and trauma, to ensure a comprehensive assessment. Results: The results demonstrated a significant difference in performance between the two models (p > 0.05). GPT-4 achieved an average rating of 8.29 out of 10, while Chat-GPT received an average rating of 6.52. The clinical psychologist's evaluation suggested that GPT-4 was more effective at generating clinically relevant and empathetic responses, thereby providing better support and guidance to potential users. Conclusions: This study contributes to the growing body of literature on the applicability of large language models in mental health care settings. The findings underscore the importance of continued research and development in the field to optimize these models for clinical use. Further investigation is necessary to understand the specific factors underlying the performance differences between the two models and to explore their generalizability across various populations and mental health conditions.

Comparing the Efficacy of GPT-4 and Chat-GPT in Mental Health Care: A Blind Assessment of Large Language Models for Psychological Support

TL;DR

This study conducts a blind, clinician-led comparison of GPT-4 and Chat-GPT across 18 psychological prompts to assess suitability for mental health support. GPT-4 consistently yields higher, more clinically relevant and empathetic responses than Chat-GPT, though both models exhibit overconfidence relative to human evaluation. The authors discuss ethical guidelines, safety considerations, and the need for robust, multi-evaluator validation before clinical deployment. Overall, the findings suggest GPT-4–like models could augment mental health support under strict oversight and rigorous validation, with future work needed to generalize across populations and conditions.

Abstract

Background: Rapid advancements in natural language processing have led to the development of large language models with the potential to revolutionize mental health care. These models have shown promise in assisting clinicians and providing support to individuals experiencing various psychological challenges. Objective: This study aims to compare the performance of two large language models, GPT-4 and Chat-GPT, in responding to a set of 18 psychological prompts, to assess their potential applicability in mental health care settings. Methods: A blind methodology was employed, with a clinical psychologist evaluating the models' responses without knowledge of their origins. The prompts encompassed a diverse range of mental health topics, including depression, anxiety, and trauma, to ensure a comprehensive assessment. Results: The results demonstrated a significant difference in performance between the two models (p > 0.05). GPT-4 achieved an average rating of 8.29 out of 10, while Chat-GPT received an average rating of 6.52. The clinical psychologist's evaluation suggested that GPT-4 was more effective at generating clinically relevant and empathetic responses, thereby providing better support and guidance to potential users. Conclusions: This study contributes to the growing body of literature on the applicability of large language models in mental health care settings. The findings underscore the importance of continued research and development in the field to optimize these models for clinical use. Further investigation is necessary to understand the specific factors underlying the performance differences between the two models and to explore their generalizability across various populations and mental health conditions.
Paper Structure (17 sections, 3 figures, 2 tables)

This paper contains 17 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Average rating of psychological advice generated by GPT-4 / Chat-GPT.
  • Figure 2: Rating for each response generated by GPT-4 / Chat-GPT.
  • Figure 3: Comparison of human rating and self-ratings for GPT-4 / Chat-GPT.