Table of Contents
Fetching ...

WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions

Seyedali Mohammadi, Edward Raff, Jinendra Malekar, Vedant Palit, Francis Ferraro, Manas Gaur

TL;DR

WellDunn tackles robustness and explainability of language models in identifying Halbert Dunn's Wellness Dimensions for mental health posts. It introduces two domain-grounded datasets, MultiWD and WellXplain, and an evaluation pipeline using SVD-based attention analysis and an Attention-Overlap Score to assess alignment between model explanations and expert ground-truth cues. The study finds that general-purpose models often outperform domain-specific ones, that abstention via Gambler's Loss can reduce performance in some settings, and that attention explanations align poorly with expert cues across both LMs and LLMs, highlighting the need for improved domain knowledge integration and careful validation before clinical deployment. These findings underscore the complexity of deploying AI in mental health contexts and point to future work in human-AI collaboration and retrieval-augmented strategies to improve reliability and explainability.

Abstract

Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a sufficient litmus test of a model's utility in clinical practice. A model that can be trusted for practice should have a correspondence between explanation and clinical determination, yet no prior research has examined the attention fidelity of these models and their effect on ground truth explanations. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WDs). We focus on two existing mental health and well-being datasets: (a) Multi-label Classification-based MultiWD, and (b) WellXplain for evaluating attention mechanism veracity against expert-labeled explanations. The labels are based on Halbert Dunn's theory of wellness, which gives grounding to our evaluation. We reveal four surprising results about LMs/LLMs: (1) Despite their human-like capabilities, GPT-3.5/4 lag behind RoBERTa, and MedAlpaca, a fine-tuned LLM on WellXplain fails to deliver any remarkable improvements in performance or explanations. (2) Re-examining LMs' predictions based on a confidence-oriented loss function reveals a significant performance drop. (3) Across all LMs/LLMs, the alignment between attention and explanations remains low, with LLMs scoring a dismal 0.0. (4) Most mental health-specific LMs/LLMs overlook domain-specific knowledge and undervalue explanations, causing these discrepancies. This study highlights the need for further research into their consistency and explanations in mental health and well-being.

WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions

TL;DR

WellDunn tackles robustness and explainability of language models in identifying Halbert Dunn's Wellness Dimensions for mental health posts. It introduces two domain-grounded datasets, MultiWD and WellXplain, and an evaluation pipeline using SVD-based attention analysis and an Attention-Overlap Score to assess alignment between model explanations and expert ground-truth cues. The study finds that general-purpose models often outperform domain-specific ones, that abstention via Gambler's Loss can reduce performance in some settings, and that attention explanations align poorly with expert cues across both LMs and LLMs, highlighting the need for improved domain knowledge integration and careful validation before clinical deployment. These findings underscore the complexity of deploying AI in mental health contexts and point to future work in human-AI collaboration and retrieval-augmented strategies to improve reliability and explainability.

Abstract

Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a sufficient litmus test of a model's utility in clinical practice. A model that can be trusted for practice should have a correspondence between explanation and clinical determination, yet no prior research has examined the attention fidelity of these models and their effect on ground truth explanations. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WDs). We focus on two existing mental health and well-being datasets: (a) Multi-label Classification-based MultiWD, and (b) WellXplain for evaluating attention mechanism veracity against expert-labeled explanations. The labels are based on Halbert Dunn's theory of wellness, which gives grounding to our evaluation. We reveal four surprising results about LMs/LLMs: (1) Despite their human-like capabilities, GPT-3.5/4 lag behind RoBERTa, and MedAlpaca, a fine-tuned LLM on WellXplain fails to deliver any remarkable improvements in performance or explanations. (2) Re-examining LMs' predictions based on a confidence-oriented loss function reveals a significant performance drop. (3) Across all LMs/LLMs, the alignment between attention and explanations remains low, with LLMs scoring a dismal 0.0. (4) Most mental health-specific LMs/LLMs overlook domain-specific knowledge and undervalue explanations, causing these discrepancies. This study highlights the need for further research into their consistency and explanations in mental health and well-being.
Paper Structure (21 sections, 1 equation, 8 figures, 16 tables)

This paper contains 21 sections, 1 equation, 8 figures, 16 tables.

Figures (8)

  • Figure 1: Motivating Example from WellXplain dataset. Expert annotators categorize user posts into four WD classes and justify their choice by highlighting pertinent parts of the text. In LM or LLM classification tasks, the goal is to identify one of the labels (1: Physical, 2: Intellectual and Vocational, 3: Social, 4: Spiritual and Emotional) based solely on relevant cues in the post. The cues are the explanations.
  • Figure 2: WellDunn workflow: MultiWD task (L) and WellXplain task (R). The architecture includes shared steps: (1) Fine-tuning of general purpose and domain-specific LMs for extracting data representations, followed by (2) feeding them into a feed-forward neural network classifier (FFNN). Two loss functions assess LMs' robustness: Sigmoid Cross-Entropy(SCE) and Gambler's Loss(GL). Singular Value Decomposition (SVD) and Attention-Overlap (AO) Score assess the explainability. In: Input, and Out: Output. WellDunn Benchmarking Box: This middle rectangle highlights the components of the benchmark system, which includes steps of (1) Fine-tuning and (2) FFNN classifier, as well as Robustness and Explainability components. The Left and right dotted rectangles grouped the components for the MultiWD and WellXplain tasks, respectively. In the case of Task 1, the input (text post) is fed into the MultiWD task, and the model produces an output (prediction) in terms of various WDs like PA, IA, SA, etc. For Task 2, the input (text post) is also fed into the WellXplain task, which produces output (prediction) along with corresponding explanations. Note that in the instruction (training), we provide both input and output, but in the evaluation (test), we provide the input.
  • Figure 3: Merging of WDs in MultiWD.The expert annotators suggest merging WDs based on their experience and literature bart2018assessment.
  • Figure 4: Bar plots illustrating the predicted probabilities from ERNIE LM fine-tuned on MultiWD. These outcomes offer a visual perspective on the two posts, revealing the contrast between GL and SCE across the 6, 5, and 4 dimensions (D). Notably, in the case of post 2, the ERNIE model with GL abstains from making the prediction. Note that the highlighted posts are obtained from SCE with 4-D. The highlighted posts for GL with 4-D and more posts are in \ref{['fig:post1_post2_example_supplementary']} and \ref{['fig:CorrectSCE_WrongGL']} (\ref{['Appendix_D']}).
  • Figure 5: Implementation details: Structure of LLama model used for fine-tuning.
  • ...and 3 more figures