Table of Contents
Fetching ...

MIRROR: Multimodal Cognitive Reframing Therapy for Rolling with Resistance

Subin Kim, Hoonrae Kim, Jihyun Lee, Yejin Jeon, Gary Geunbae Lee

TL;DR

This work tackles the challenge of client resistance in AI-assisted CBT by introducing Mirror, a synthetic multimodal dataset that pairs client statements with facial expressions to train vision-language models for emotion-aware therapy. The authors design a three-step data pipeline—multimodal dialogue design, counseling screenplay generation, and facial expression synthesis—augmented by six quality filters and two safety filters, plus a reasoning framework of planning and emotional captioning. Evaluations show that vision-augmented models outperform text-only baselines in therapist skills and therapeutic alliance under resistance, with planning and emotional captioning yielding the strongest results and domain experts endorsing Mirror_P+EC. Real-world demonstrations on motivational interviewing videos suggest practical applicability while highlighting limitations around privacy, bias, and evaluation; the work also provides a publicly released dataset and supporting code to propel multimodal CBT research. Overall, Mirror advances multimodal, resistance-aware AI therapists and sets a foundation for safer, more empathic AI-assisted psychotherapy, though careful consideration of ethics, diverse representation, and clinical validation remains essential.

Abstract

Recent studies have explored the use of large language models (LLMs) in psychotherapy; however, text-based cognitive behavioral therapy (CBT) models often struggle with client resistance, which can weaken therapeutic alliance. To address this, we propose a multimodal approach that incorporates nonverbal cues, which allows the AI therapist to better align its responses with the client's negative emotional state. Specifically, we introduce a new synthetic dataset, Mirror (Multimodal Interactive Rolling with Resistance), which is a novel synthetic dataset that pairs each client's statements with corresponding facial images. Using this dataset, we train baseline vision language models (VLMs) so that they can analyze facial cues, infer emotions, and generate empathetic responses to effectively manage client resistance. These models are then evaluated in terms of both their counseling skills as a therapist, and the strength of therapeutic alliance in the presence of client resistance. Our results demonstrate that Mirror significantly enhances the AI therapist's ability to handle resistance, which outperforms existing text-based CBT approaches. Human expert evaluations further confirm the effectiveness of our approach in managing client resistance and fostering therapeutic alliance.

MIRROR: Multimodal Cognitive Reframing Therapy for Rolling with Resistance

TL;DR

This work tackles the challenge of client resistance in AI-assisted CBT by introducing Mirror, a synthetic multimodal dataset that pairs client statements with facial expressions to train vision-language models for emotion-aware therapy. The authors design a three-step data pipeline—multimodal dialogue design, counseling screenplay generation, and facial expression synthesis—augmented by six quality filters and two safety filters, plus a reasoning framework of planning and emotional captioning. Evaluations show that vision-augmented models outperform text-only baselines in therapist skills and therapeutic alliance under resistance, with planning and emotional captioning yielding the strongest results and domain experts endorsing Mirror_P+EC. Real-world demonstrations on motivational interviewing videos suggest practical applicability while highlighting limitations around privacy, bias, and evaluation; the work also provides a publicly released dataset and supporting code to propel multimodal CBT research. Overall, Mirror advances multimodal, resistance-aware AI therapists and sets a foundation for safer, more empathic AI-assisted psychotherapy, though careful consideration of ethics, diverse representation, and clinical validation remains essential.

Abstract

Recent studies have explored the use of large language models (LLMs) in psychotherapy; however, text-based cognitive behavioral therapy (CBT) models often struggle with client resistance, which can weaken therapeutic alliance. To address this, we propose a multimodal approach that incorporates nonverbal cues, which allows the AI therapist to better align its responses with the client's negative emotional state. Specifically, we introduce a new synthetic dataset, Mirror (Multimodal Interactive Rolling with Resistance), which is a novel synthetic dataset that pairs each client's statements with corresponding facial images. Using this dataset, we train baseline vision language models (VLMs) so that they can analyze facial cues, infer emotions, and generate empathetic responses to effectively manage client resistance. These models are then evaluated in terms of both their counseling skills as a therapist, and the strength of therapeutic alliance in the presence of client resistance. Our results demonstrate that Mirror significantly enhances the AI therapist's ability to handle resistance, which outperforms existing text-based CBT approaches. Human expert evaluations further confirm the effectiveness of our approach in managing client resistance and fostering therapeutic alliance.

Paper Structure

This paper contains 60 sections, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Text-based therapists have limitations in interpreting nonverbal cues, as they cannot perceive behaviors such as sighs or posture shifts, which can lead to premature problem-solving rather than addressing deeper emotions.
  • Figure 2: Overview of the Mirror dataset construction. The pipeline consists of three main stages: Multimodal Dialogue Design (§\ref{['sec:step1']}), Counseling Screenplay Generation (§\ref{['sec:step2']}), and Facial Expression Synthesis (§\ref{['sec:step3']}).
  • Figure 3: The overview of the planning process.
  • Figure 4: Overview of emotional captioning. The AI therapist infers the client’s emotional state from facial cues and uses it to generate an empathetic, aligned response.
  • Figure 5: Pairwise comparison results among Mirror-LLaVA, Camel-LLaMA3 and LLaMA-3-8B, on three evaluation criteria—Goal, Approach, and Affective Bond—rated by two psychotherapists.
  • ...and 13 more figures