Table of Contents
Fetching ...

From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications

Philipp Steigerwald, Jens Albrecht

TL;DR

This paper investigates generating six-word German subject lines from psychosocial counselling emails using eleven LLMs, evaluated via a hierarchical assessment with categorisation (Good/Fair/Poor) and within-category ranking by nine assessors. It demonstrates a clear performance split: OpenAI's GPT-4o and GPT-3.5 Turbo achieve about 73% Good, while German-tuned open-source models (e.g., SauerkrautLM Mixtral 8×7b Q8) reach up to 54% Good, underscoring the value of language adaptation and model size. The study introduces a rigorous, multi-metric framework including $\alpha$, $\rho$, $\tau$, and $r$ to assess reliability, correlation, and convergence between categorical and ranking analyses, and it highlights privacy implications of cloud-based models versus local, privacy-preserving open-source options. The findings support cautious, ethical deployment of AI-assisted subject line generation in e-mental health, emphasizing human-in-the-loop evaluation, data privacy, and the potential for targeted fine-tuning and task-specific improvements to narrow the gap to proprietary systems.

Abstract

Psychosocial online counselling frequently encounters generic subject lines that impede efficient case prioritisation. This study evaluates eleven large language models generating six-word subject lines for German counselling emails through hierarchical assessment - first categorising outputs, then ranking within categories to enable manageable evaluation. Nine assessors (counselling professionals and AI systems) enable analysis via Krippendorff's $α$, Spearman's $ρ$, Pearson's $r$ and Kendall's $τ$. Results reveal performance trade-offs between proprietary services and privacy-preserving open-source alternatives, with German fine-tuning consistently improving performance. The study addresses critical ethical considerations for mental health AI deployment including privacy, bias and accountability.

From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications

TL;DR

This paper investigates generating six-word German subject lines from psychosocial counselling emails using eleven LLMs, evaluated via a hierarchical assessment with categorisation (Good/Fair/Poor) and within-category ranking by nine assessors. It demonstrates a clear performance split: OpenAI's GPT-4o and GPT-3.5 Turbo achieve about 73% Good, while German-tuned open-source models (e.g., SauerkrautLM Mixtral 8×7b Q8) reach up to 54% Good, underscoring the value of language adaptation and model size. The study introduces a rigorous, multi-metric framework including , , , and to assess reliability, correlation, and convergence between categorical and ranking analyses, and it highlights privacy implications of cloud-based models versus local, privacy-preserving open-source options. The findings support cautious, ethical deployment of AI-assisted subject line generation in e-mental health, emphasizing human-in-the-loop evaluation, data privacy, and the potential for targeted fine-tuning and task-specific improvements to narrow the gap to proprietary systems.

Abstract

Psychosocial online counselling frequently encounters generic subject lines that impede efficient case prioritisation. This study evaluates eleven large language models generating six-word subject lines for German counselling emails through hierarchical assessment - first categorising outputs, then ranking within categories to enable manageable evaluation. Nine assessors (counselling professionals and AI systems) enable analysis via Krippendorff's , Spearman's , Pearson's and Kendall's . Results reveal performance trade-offs between proprietary services and privacy-preserving open-source alternatives, with German fine-tuning consistently improving performance. The study addresses critical ethical considerations for mental health AI deployment including privacy, bias and accountability.
Paper Structure (25 sections, 4 equations, 9 figures, 1 table)

This paper contains 25 sections, 4 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Evaluation workflow showing the progression from 23 counselling email threads through LLM processing (11 models generating 253 subject lines), assessment by nine evaluators (five human counselling professionals and four AI systems producing 2,277 assessments), to final filtered assessments (1,233) after applying inter-rater reliability thresholds.
  • Figure 2: Hierarchical categorisation example for a counselling email about self-harm concerns, demonstrating quality distinctions between Good (specific), Fair (partial) and Poor (generic) subject lines (adapted from steigerwald_comparing_2025).
  • Figure 3: Relationship between Krippendorff's $\alpha$ (blue), remaining data ratio (green) and minimum agreement threshold. The retained data proportion decreases as agreement thresholds rise, causing a corresponding increase in Krippendorff's $\alpha$ values. The red marker indicates where $\alpha$ exceeds 0.667 at 57% agreement, achieving $\alpha$ = 0.70 whilst retaining 54.2% of the data.
  • Figure 4: Distribution of ratings before (solid bars, n = 2,277) and after filtering (hatched bars, n = 1,233). Fair ratings showed the strongest reduction (56.7%), followed by Good (42.4%) and Poor ratings (36.8%).
  • Figure 5: Distribution of filtered ratings (Good, Fair and Poor) across all evaluated models. Model names are abbreviated as follows: LM3.1 (Llama 3.1), LM3 (Llama 3) and 8×7b (Mixtral 8×7b). The suffixes Q4 and Q8 denote 4-bit and 8-bit quantisation, respectively.
  • ...and 4 more figures