Table of Contents
Fetching ...

Are Large Language Models Possible to Conduct Cognitive Behavioral Therapy?

Hao Shen, Zihan Li, Minqiang Yang, Minghui Ni, Yongfeng Tao, Zhengyang Yu, Weihao Zheng, Chen Xu, Bin Hu

TL;DR

The paper investigates whether large language models can feasibly conduct Cognitive Behavioral Therapy by building an automatic evaluation framework that assesses emotion tendency, structured dialogue patterns, and proactive questioning using a real CBT dialogue corpus. It evaluates four representative LLMs under single-turn and multi-turn settings, and further augments models with a CBT knowledge base to study retrieval-augmented effects. Results show that LLMs have potential for CBT, with ChatGPT typically performing best on many metrics, especially after knowledge-base integration, while multi-turn performance varies by model. The work highlights practical implications for scalable psychological support and underscores privacy, data coverage, and evaluation challenges that must be addressed for real-world deployment.

Abstract

In contemporary society, the issue of psychological health has become increasingly prominent, characterized by the diversification, complexity, and universality of mental disorders. Cognitive Behavioral Therapy (CBT), currently the most influential and clinically effective psychological treatment method with no side effects, has limited coverage and poor quality in most countries. In recent years, researches on the recognition and intervention of emotional disorders using large language models (LLMs) have been validated, providing new possibilities for psychological assistance therapy. However, are LLMs truly possible to conduct cognitive behavioral therapy? Many concerns have been raised by mental health experts regarding the use of LLMs for therapy. Seeking to answer this question, we collected real CBT corpus from online video websites, designed and conducted a targeted automatic evaluation framework involving the evaluation of emotion tendency of generated text, structured dialogue pattern and proactive inquiry ability. For emotion tendency, we calculate the emotion tendency score of the CBT dialogue text generated by each model. For structured dialogue pattern, we use a diverse range of automatic evaluation metrics to compare speaking style, the ability to maintain consistency of topic and the use of technology in CBT between different models . As for inquiring to guide the patient, we utilize PQA (Proactive Questioning Ability) metric. We also evaluated the CBT ability of the LLM after integrating a CBT knowledge base to explore the help of introducing additional knowledge to enhance the model's CBT counseling ability. Four LLM variants with excellent performance on natural language processing are evaluated, and the experimental result shows the great potential of LLMs in psychological counseling realm, especially after combining with other technological means.

Are Large Language Models Possible to Conduct Cognitive Behavioral Therapy?

TL;DR

The paper investigates whether large language models can feasibly conduct Cognitive Behavioral Therapy by building an automatic evaluation framework that assesses emotion tendency, structured dialogue patterns, and proactive questioning using a real CBT dialogue corpus. It evaluates four representative LLMs under single-turn and multi-turn settings, and further augments models with a CBT knowledge base to study retrieval-augmented effects. Results show that LLMs have potential for CBT, with ChatGPT typically performing best on many metrics, especially after knowledge-base integration, while multi-turn performance varies by model. The work highlights practical implications for scalable psychological support and underscores privacy, data coverage, and evaluation challenges that must be addressed for real-world deployment.

Abstract

In contemporary society, the issue of psychological health has become increasingly prominent, characterized by the diversification, complexity, and universality of mental disorders. Cognitive Behavioral Therapy (CBT), currently the most influential and clinically effective psychological treatment method with no side effects, has limited coverage and poor quality in most countries. In recent years, researches on the recognition and intervention of emotional disorders using large language models (LLMs) have been validated, providing new possibilities for psychological assistance therapy. However, are LLMs truly possible to conduct cognitive behavioral therapy? Many concerns have been raised by mental health experts regarding the use of LLMs for therapy. Seeking to answer this question, we collected real CBT corpus from online video websites, designed and conducted a targeted automatic evaluation framework involving the evaluation of emotion tendency of generated text, structured dialogue pattern and proactive inquiry ability. For emotion tendency, we calculate the emotion tendency score of the CBT dialogue text generated by each model. For structured dialogue pattern, we use a diverse range of automatic evaluation metrics to compare speaking style, the ability to maintain consistency of topic and the use of technology in CBT between different models . As for inquiring to guide the patient, we utilize PQA (Proactive Questioning Ability) metric. We also evaluated the CBT ability of the LLM after integrating a CBT knowledge base to explore the help of introducing additional knowledge to enhance the model's CBT counseling ability. Four LLM variants with excellent performance on natural language processing are evaluated, and the experimental result shows the great potential of LLMs in psychological counseling realm, especially after combining with other technological means.
Paper Structure (9 sections, 7 figures, 16 tables)

This paper contains 9 sections, 7 figures, 16 tables.

Figures (7)

  • Figure 1: The overview of our automatic computational evaluation framework. We constructed two types of LLM-based CBT therapists and assessed their CBT counseling ability under single-turn conversations and multi-turn conversations.
  • Figure 2: Cloud maps of the reference text and the text generated by general large language models.
  • Figure 3: Emotion Scores of the CBT text generated by different general large language models. Blue bars represent the emotion score distribution under single-turn conversation, and the red bars represent the emotion score distribution under multi-turn conversation.
  • Figure 4: Radar charts of normalized evaluation metrics of general large language models.
  • Figure 5: Cloud maps of the reference text and the text generated by knowledge base integrated large language models.
  • ...and 2 more figures