From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications

Philipp Steigerwald; Jens Albrecht

From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications

Philipp Steigerwald, Jens Albrecht

TL;DR

This paper investigates generating six-word German subject lines from psychosocial counselling emails using eleven LLMs, evaluated via a hierarchical assessment with categorisation (Good/Fair/Poor) and within-category ranking by nine assessors. It demonstrates a clear performance split: OpenAI's GPT-4o and GPT-3.5 Turbo achieve about 73% Good, while German-tuned open-source models (e.g., SauerkrautLM Mixtral 8×7b Q8) reach up to 54% Good, underscoring the value of language adaptation and model size. The study introduces a rigorous, multi-metric framework including $\alpha$, $\rho$, $\tau$, and $r$ to assess reliability, correlation, and convergence between categorical and ranking analyses, and it highlights privacy implications of cloud-based models versus local, privacy-preserving open-source options. The findings support cautious, ethical deployment of AI-assisted subject line generation in e-mental health, emphasizing human-in-the-loop evaluation, data privacy, and the potential for targeted fine-tuning and task-specific improvements to narrow the gap to proprietary systems.

Abstract

Psychosocial online counselling frequently encounters generic subject lines that impede efficient case prioritisation. This study evaluates eleven large language models generating six-word subject lines for German counselling emails through hierarchical assessment - first categorising outputs, then ranking within categories to enable manageable evaluation. Nine assessors (counselling professionals and AI systems) enable analysis via Krippendorff's $α$, Spearman's $ρ$, Pearson's $r$ and Kendall's $τ$. Results reveal performance trade-offs between proprietary services and privacy-preserving open-source alternatives, with German fine-tuning consistently improving performance. The study addresses critical ethical considerations for mental health AI deployment including privacy, bias and accountability.

From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications

TL;DR

, and

to assess reliability, correlation, and convergence between categorical and ranking analyses, and it highlights privacy implications of cloud-based models versus local, privacy-preserving open-source options. The findings support cautious, ethical deployment of AI-assisted subject line generation in e-mental health, emphasizing human-in-the-loop evaluation, data privacy, and the potential for targeted fine-tuning and task-specific improvements to narrow the gap to proprietary systems.

Abstract

, Spearman's

, Pearson's

and Kendall's

. Results reveal performance trade-offs between proprietary services and privacy-preserving open-source alternatives, with German fine-tuning consistently improving performance. The study addresses critical ethical considerations for mental health AI deployment including privacy, bias and accountability.

Paper Structure (25 sections, 4 equations, 9 figures, 1 table)

This paper contains 25 sections, 4 equations, 9 figures, 1 table.

Introduction
Related Work
Text Summarisation and Ultra-Short Summary Generation
LLMs in Mental Health Applications
Evaluation Methodologies for LLM Outputs
Methodology
Study Design
Email Thread Construction
LLM Selection
Subject Line Generation
Assessor Configuration
Statistical Analysis Framework
Results
Data Filtering and Reliability Enhancement
Categorical Rating Analysis
...and 10 more sections

Figures (9)

Figure 1: Evaluation workflow showing the progression from 23 counselling email threads through LLM processing (11 models generating 253 subject lines), assessment by nine evaluators (five human counselling professionals and four AI systems producing 2,277 assessments), to final filtered assessments (1,233) after applying inter-rater reliability thresholds.
Figure 2: Hierarchical categorisation example for a counselling email about self-harm concerns, demonstrating quality distinctions between Good (specific), Fair (partial) and Poor (generic) subject lines (adapted from steigerwald_comparing_2025).
Figure 3: Relationship between Krippendorff's $\alpha$ (blue), remaining data ratio (green) and minimum agreement threshold. The retained data proportion decreases as agreement thresholds rise, causing a corresponding increase in Krippendorff's $\alpha$ values. The red marker indicates where $\alpha$ exceeds 0.667 at 57% agreement, achieving $\alpha$ = 0.70 whilst retaining 54.2% of the data.
Figure 4: Distribution of ratings before (solid bars, n = 2,277) and after filtering (hatched bars, n = 1,233). Fair ratings showed the strongest reduction (56.7%), followed by Good (42.4%) and Poor ratings (36.8%).
Figure 5: Distribution of filtered ratings (Good, Fair and Poor) across all evaluated models. Model names are abbreviated as follows: LM3.1 (Llama 3.1), LM3 (Llama 3) and 8×7b (Mixtral 8×7b). The suffixes Q4 and Q8 denote 4-bit and 8-bit quantisation, respectively.
...and 4 more figures

From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications

TL;DR

Abstract

From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications

Authors

TL;DR

Abstract

Table of Contents

Figures (9)