Table of Contents
Fetching ...

Multimodal Cognitive Reframing Therapy via Multi-hop Psychotherapeutic Reasoning

Subin Kim, Hoonrae Kim, Heejin Do, Gary Geunbae Lee

TL;DR

AI-driven cognitive reframing has struggled to utilize non-verbal cues; this work extends cognitive reframing to multimodal therapy by introducing M2CoSC, a dataset pairing GPT-4–generated dialogues with client facial images. It also introduces a multi-hop psychotherapeutic reasoning method that grounds interventions in implicit evidences such as facial expressions, thoughts, and cognitive distortions. Experiments show that vision-language models trained on M2CoSC with multi-hop reasoning achieve higher empathy and coherent guidance, validated by GPT-4 and human judges, compared to text-only baselines. This work highlights the practical potential of multimodal AI therapists and outlines directions for expanding non-verbal cues in AI-assisted psychotherapy.

Abstract

Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client's facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs' performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.

Multimodal Cognitive Reframing Therapy via Multi-hop Psychotherapeutic Reasoning

TL;DR

AI-driven cognitive reframing has struggled to utilize non-verbal cues; this work extends cognitive reframing to multimodal therapy by introducing M2CoSC, a dataset pairing GPT-4–generated dialogues with client facial images. It also introduces a multi-hop psychotherapeutic reasoning method that grounds interventions in implicit evidences such as facial expressions, thoughts, and cognitive distortions. Experiments show that vision-language models trained on M2CoSC with multi-hop reasoning achieve higher empathy and coherent guidance, validated by GPT-4 and human judges, compared to text-only baselines. This work highlights the practical potential of multimodal AI therapists and outlines directions for expanding non-verbal cues in AI-assisted psychotherapy.

Abstract

Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client's facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs' performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.

Paper Structure

This paper contains 32 sections, 14 figures, 10 tables, 1 algorithm.

Figures (14)

  • Figure 1: Illustration of a multimodal conversational cognitive reframing. The therapist uses both verbal and non-verbal information to assess the client's states and then provides appropriate interventions.
  • Figure 2: An example illustrating the construction of the M2CoSC dataset. Left: the prompt given to GPT-4 in the client role; Right: the prompt given to GPT-4 Vision in the therapist role. GPT-4 Vision is given a client's face image. [dialogue history] denotes a history of conversations accumulated during role play.
  • Figure 3: Comparison of standard prompting and multi-hop psychotherapeutic reasoning. The multi-hop approach integrates the client's emotional and cognitive state (facial expressions, thoughts, and cognitive distortions) at each step of the intervention. The conversation on the left shows the therapist’s replies, which correspond to the four stages—Introduction, Guidance, Brainstorming, and Suggestion—outlined on the right.
  • Figure 4: Dialogue-level win rates assessed by GPT-4. Detailed numerical results are provided in Appendix \ref{['sec:gpt_wr_aisim']}.
  • Figure 5: Stage-wise win rates assessed by GPT-4 at each stage of the M2CoSC benchmark. Numerical results are provided in Appendix \ref{['sec:gpt_wr_csconv']}.
  • ...and 9 more figures