Table of Contents
Fetching ...

Lived Experience Not Found: LLMs Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use

Mohit Chandra, Siddharth Sriraman, Gaurav Verma, Harneet Singh Khanuja, Jose Suarez Campayo, Zihang Li, Michael L. Birnbaum, Munmun De Choudhury

TL;DR

This work introduces the Psych-ADR benchmark and the ADRA framework to systematically evaluate how large language models detect ADRs related to psychiatric medications and respond with expert-aligned harm-reduction guidance. Across 239 Reddit posts, LLMs show meaningful but incomplete capability: ADR detection approaches ~77% accuracy in the best cases, yet many models misclassify ADR types and exhibit risk-averse biases. In alignment tasks, LLMs struggle with readability and actionable, physician-aligned harm-reduction strategies, achieving only up to 70.86% agreement with experts on HRS and generally lower actionability compared to clinician responses. The study highlights the need to incorporate lived experience and domain-specific alignment into high-risk AI systems, and provides a benchmark and framework to drive future improvements in medical dialogue AI.

Abstract

Adverse Drug Reactions (ADRs) from psychiatric medications are the leading cause of hospitalizations among mental health patients. With healthcare systems and online communities facing limitations in resolving ADR-related issues, Large Language Models (LLMs) have the potential to fill this gap. Despite the increasing capabilities of LLMs, past research has not explored their capabilities in detecting ADRs related to psychiatric medications or in providing effective harm reduction strategies. To address this, we introduce the Psych-ADR benchmark and the Adverse Drug Reaction Response Assessment (ADRA) framework to systematically evaluate LLM performance in detecting ADR expressions and delivering expert-aligned mitigation strategies. Our analyses show that LLMs struggle with understanding the nuances of ADRs and differentiating between types of ADRs. While LLMs align with experts in terms of expressed emotions and tone of the text, their responses are more complex, harder to read, and only 70.86% aligned with expert strategies. Furthermore, they provide less actionable advice by a margin of 12.32% on average. Our work provides a comprehensive benchmark and evaluation framework for assessing LLMs in strategy-driven tasks within high-risk domains.

Lived Experience Not Found: LLMs Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use

TL;DR

This work introduces the Psych-ADR benchmark and the ADRA framework to systematically evaluate how large language models detect ADRs related to psychiatric medications and respond with expert-aligned harm-reduction guidance. Across 239 Reddit posts, LLMs show meaningful but incomplete capability: ADR detection approaches ~77% accuracy in the best cases, yet many models misclassify ADR types and exhibit risk-averse biases. In alignment tasks, LLMs struggle with readability and actionable, physician-aligned harm-reduction strategies, achieving only up to 70.86% agreement with experts on HRS and generally lower actionability compared to clinician responses. The study highlights the need to incorporate lived experience and domain-specific alignment into high-risk AI systems, and provides a benchmark and framework to drive future improvements in medical dialogue AI.

Abstract

Adverse Drug Reactions (ADRs) from psychiatric medications are the leading cause of hospitalizations among mental health patients. With healthcare systems and online communities facing limitations in resolving ADR-related issues, Large Language Models (LLMs) have the potential to fill this gap. Despite the increasing capabilities of LLMs, past research has not explored their capabilities in detecting ADRs related to psychiatric medications or in providing effective harm reduction strategies. To address this, we introduce the Psych-ADR benchmark and the Adverse Drug Reaction Response Assessment (ADRA) framework to systematically evaluate LLM performance in detecting ADR expressions and delivering expert-aligned mitigation strategies. Our analyses show that LLMs struggle with understanding the nuances of ADRs and differentiating between types of ADRs. While LLMs align with experts in terms of expressed emotions and tone of the text, their responses are more complex, harder to read, and only 70.86% aligned with expert strategies. Furthermore, they provide less actionable advice by a margin of 12.32% on average. Our work provides a comprehensive benchmark and evaluation framework for assessing LLMs in strategy-driven tasks within high-risk domains.

Paper Structure

This paper contains 27 sections, 1 equation, 8 figures, 16 tables.

Figures (8)

  • Figure 1: Overview of work; we present two tasks in this work -- ADR detection and multiclass classification (RQ1), and Expert-LLM response alignment (RQ2).
  • Figure 2: Mean KL Divergence score for the empath categories distribution between models and the expert responses in the Psych-ADR benchmark dataset. (Lower score is better).
  • Figure 3: Mean SMOG Scores and 95% Confidence Intervals for Various Models (lower values are better).
  • Figure 4: Annotation interface for the Psych-ADR benchmark used in the annotation process. The interface displays the post title and content, along with access to annotation guidelines. In the left image screenshot, the annotator identifies an adverse drug reaction (ADR) related to psychiatric medication, then provides a brief rationale and selects the class of ADR. In the right image screenshot, the annotator indicates that no ADR is present, in which case only a rationale for this decision is required.
  • Figure 5: Sample answer representing the structure of answers provided in the Psych-ADR benchmark dataset.
  • ...and 3 more figures