A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG

Arshia Kermani; Veronica Perez-Rosas; Vangelis Metsis

A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG

Arshia Kermani, Veronica Perez-Rosas, Vangelis Metsis

TL;DR

The paper systematically evaluates three LLM deployment strategies—fine-tuning with LoRA, prompt engineering (zero-shot and few-shot), and retrieval-augmented generation (RAG)—for mental health text analysis using LLaMA 3-8B across the DAIR-AI Emotion and SWMH datasets. It demonstrates that fine-tuning yields the highest accuracy (up to 91% for emotion classification and 80% for mental health condition detection) but requires substantial computational resources, whereas prompt engineering and RAG offer more deployment flexibility with moderate performance. Zero-shot prompting emerges as a viable alternative for certain tasks (e.g., SWMH with 68% accuracy) and RAG shows improvements only when retrieval quality is high, highlighting trade-offs between accuracy, compute, and deployment practicality. The results provide practical guidance for implementing LLM-based mental health assessment tools and suggest avenues for hybrid methods and clinical validation to balance performance with real-world constraints.

Abstract

This study presents a systematic comparison of three approaches for the analysis of mental health text using large language models (LLMs): prompt engineering, retrieval augmented generation (RAG), and fine-tuning. Using LLaMA 3, we evaluate these approaches on emotion classification and mental health condition detection tasks across two datasets. Fine-tuning achieves the highest accuracy (91% for emotion classification, 80% for mental health conditions) but requires substantial computational resources and large training sets, while prompt engineering and RAG offer more flexible deployment with moderate performance (40-68% accuracy). Our findings provide practical insights for implementing LLM-based solutions in mental health applications, highlighting the trade-offs between accuracy, computational requirements, and deployment flexibility.

A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG

TL;DR

Abstract

A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)