Table of Contents
Fetching ...

Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations

Mohit Chandra, Siddharth Sriraman, Harneet Singh Khanuja, Yiqiao Jin, Munmun De Choudhury

TL;DR

This work introduces MedAgent, a framework for generating realistic, multi-turn mental health sensemaking conversations, and the MHSD dataset with 2,284 synthetic dialogues. It also presents MultiSenseEval, a holistic evaluation framework that assesses patient-centric communication, conversational flow, diagnostic accuracy, and readability, validated through automated metrics and human evaluation. The experiments show frontier reasoning models underperform on patient-centric metrics and exact diagnosis, with performance influenced by patient persona and decreasing as conversation length grows, underscoring the challenges of sustained, meaningful mental health interactions with LLMs. The authors provide synthetic data, an evaluation platform, and insights that inform safer, more effective development of LLMs in high-stakes healthcare contexts.

Abstract

Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient-LLM conversations. Additionally, we present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings using human-centric criteria. Our findings reveal that frontier reasoning models yield below-par performance for patient-centric communication and struggle at advanced diagnostic capabilities with average score of 31%. Additionally, we observed variation in model performance based on patient's persona and performance drop with increasing turns in the conversation. Our work provides a comprehensive synthetic data generation framework, a dataset and evaluation framework for assessing LLMs in multi-turn mental health conversations.

Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations

TL;DR

This work introduces MedAgent, a framework for generating realistic, multi-turn mental health sensemaking conversations, and the MHSD dataset with 2,284 synthetic dialogues. It also presents MultiSenseEval, a holistic evaluation framework that assesses patient-centric communication, conversational flow, diagnostic accuracy, and readability, validated through automated metrics and human evaluation. The experiments show frontier reasoning models underperform on patient-centric metrics and exact diagnosis, with performance influenced by patient persona and decreasing as conversation length grows, underscoring the challenges of sustained, meaningful mental health interactions with LLMs. The authors provide synthetic data, an evaluation platform, and insights that inform safer, more effective development of LLMs in high-stakes healthcare contexts.

Abstract

Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient-LLM conversations. Additionally, we present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings using human-centric criteria. Our findings reveal that frontier reasoning models yield below-par performance for patient-centric communication and struggle at advanced diagnostic capabilities with average score of 31%. Additionally, we observed variation in model performance based on patient's persona and performance drop with increasing turns in the conversation. Our work provides a comprehensive synthetic data generation framework, a dataset and evaluation framework for assessing LLMs in multi-turn mental health conversations.

Paper Structure

This paper contains 29 sections, 5 figures, 30 tables, 2 algorithms.

Figures (5)

  • Figure 1: We present the MedAgent framework for generating realistic multi-turn mental health sensemaking conversations (part (a)). Using this framework we create the MHSD dataset with 2,284 conversations. Finally, we also present the MultiSenseEval framework (part (b)) to holistically evaluate LLM performance across patient-centric communication, conversational flow and correctness, diagnostic accuracy, and readability.
  • Figure 2: Sample conversation between the sensemaker and a patient with high conscientiousness (traits listed by first letter) and basic medical literacy. Stages are distinguished by color, with some intermediate dialogues skipped for conciseness. Diagnosis is in bold.
  • Figure 3: Performance comparison of OpenAI o1 and DeepSeek-R1 across Perceived Susceptibility, Perceived Severity, Perceived Benefits, and Conversation Flow and Correctness. Bars indicate mean scores with 95% confidence interval. All scores are on a 4-point Likert scale ((1): Very Poor to (4): Very Good). Both models obtain scores below "Good Performance" rating for the three patient-centric communication metrics, but exceed the "Good Performance" threshold for Conversation Flow and Correctness.
  • Figure 4: Performance comparison between OpenAI o1 and DeepSeek R1 across Hard Diagnostic Accuracy, and Soft Diagnostic Accuracy. Bars indicate the mean scores with 95% confidence interval. As observed the performance for both models drops by more than 50% when the diagnosis is matched exactly with the ground truth ("Hard Accuracy") in comparison to when it is matched on broader/general criteria ("Soft Accuracy").
  • Figure 5: Performance trend for o1 and R1 across the MultiSenseEval framework metrics, the x-axis indicates sensemaker message count bins. While on average performance on patient-centric metrics and diagnostic accuracy declined with longer conversations, flow correctness and readability improved.