From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications
Philipp Steigerwald, Jens Albrecht
TL;DR
This paper investigates generating six-word German subject lines from psychosocial counselling emails using eleven LLMs, evaluated via a hierarchical assessment with categorisation (Good/Fair/Poor) and within-category ranking by nine assessors. It demonstrates a clear performance split: OpenAI's GPT-4o and GPT-3.5 Turbo achieve about 73% Good, while German-tuned open-source models (e.g., SauerkrautLM Mixtral 8×7b Q8) reach up to 54% Good, underscoring the value of language adaptation and model size. The study introduces a rigorous, multi-metric framework including $\alpha$, $\rho$, $\tau$, and $r$ to assess reliability, correlation, and convergence between categorical and ranking analyses, and it highlights privacy implications of cloud-based models versus local, privacy-preserving open-source options. The findings support cautious, ethical deployment of AI-assisted subject line generation in e-mental health, emphasizing human-in-the-loop evaluation, data privacy, and the potential for targeted fine-tuning and task-specific improvements to narrow the gap to proprietary systems.
Abstract
Psychosocial online counselling frequently encounters generic subject lines that impede efficient case prioritisation. This study evaluates eleven large language models generating six-word subject lines for German counselling emails through hierarchical assessment - first categorising outputs, then ranking within categories to enable manageable evaluation. Nine assessors (counselling professionals and AI systems) enable analysis via Krippendorff's $α$, Spearman's $ρ$, Pearson's $r$ and Kendall's $τ$. Results reveal performance trade-offs between proprietary services and privacy-preserving open-source alternatives, with German fine-tuning consistently improving performance. The study addresses critical ethical considerations for mental health AI deployment including privacy, bias and accountability.
