Mitigating Semantic Drift: Evaluating LLMs' Efficacy in Psychotherapy through MI Dialogue Summarization
Vivek Kumar, Pushpraj Singh Rajawat, Eirini Ntoutsi
TL;DR
This paper tackles semantic drift in applying large language models to psychotherapy by introducing MITI-based annotation for motivational interviewing (MI) dialogues and the AnnoSUM-MI dataset. It adopts a two-stage annotation framework combining expert labels and LLM-generated summaries, evaluated via multi-output scoring across six MITI dimensions using one-shot and few-shot prompting across three models (ChatGPT, Gemini, DeepSeek). Key contributions include a MITI-grounded annotation scheme, a publicly available annotated MI dataset, and an evaluation framework focused on contextual fidelity rather than traditional accuracy metrics. Findings indicate ChatGPT most closely tracks expert judgments, while Gemini underperforms, highlighting the need for careful prompting and independent validation in sensitive, low-resource domains.
Abstract
Recent advancements in large language models (LLMs) have shown their potential across both general and domain-specific tasks. However, there is a growing concern regarding their lack of sensitivity, factual incorrectness in responses, inconsistent expressions of empathy, bias, hallucinations, and overall inability to capture the depth and complexity of human understanding, especially in low-resource and sensitive domains such as psychology. To address these challenges, our study employs a mixed-methods approach to evaluate the efficacy of LLMs in psychotherapy. We use LLMs to generate precise summaries of motivational interviewing (MI) dialogues and design a two-stage annotation scheme based on key components of the Motivational Interviewing Treatment Integrity (MITI) framework, namely evocation, collaboration, autonomy, direction, empathy, and a non-judgmental attitude. Using expert-annotated MI dialogues as ground truth, we formulate multi-class classification tasks to assess model performance under progressive prompting techniques, incorporating one-shot and few-shot prompting. Our results offer insights into LLMs' capacity for understanding complex psychological constructs and highlight best practices to mitigate ``semantic drift" in therapeutic settings. Our work contributes not only to the MI community by providing a high-quality annotated dataset to address data scarcity in low-resource domains but also critical insights for using LLMs for precise contextual interpretation in complex behavioral therapy.
