Table of Contents
Fetching ...

Cognitive-Mental-LLM: Evaluating Reasoning in Large Language Models for Mental Health Prediction via Online Text

Avinash Patil, Amardeep Kour Gedhu

TL;DR

This study benchmarks structured reasoning techniques—Chain-of-Thought (CoT), Self-Consistency CoT (SC-CoT), Tree-of-Thought (ToT), and Few-Shot CoT (FS-CoT)—for mental health text classification using OpenAI's o3-mini across five Reddit-derived datasets. Results show that Few-Shot CoT reliably boosts multi-class classification (e.g., CSSRS, DepSeverity), while CoT and SC-CoT improve binary-task robustness (e.g., Dreaddit, SDCNL); however, zero-shot baselines remain competitive on imbalanced data like RedSam. Compared with traditional transformers (BERT, RoBERTa, FLAN-T5), reasoning-based prompts offer interpretability and targeted gains but do not universally outperform fine-tuned models, underscoring dataset-dependent trade-offs. The work highlights the potential of reasoning-driven LLMs for scalable mental health assessment while outlining challenges in long-text handling, class imbalance, and multi-class nuance, and it suggests hybrid approaches and automatic prompt optimization as promising directions for future research.

Abstract

Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52\% over M-LLM, +0.82\% over BERT) and SDCNL (+4.67\% over M-LLM, +2.17\% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.

Cognitive-Mental-LLM: Evaluating Reasoning in Large Language Models for Mental Health Prediction via Online Text

TL;DR

This study benchmarks structured reasoning techniques—Chain-of-Thought (CoT), Self-Consistency CoT (SC-CoT), Tree-of-Thought (ToT), and Few-Shot CoT (FS-CoT)—for mental health text classification using OpenAI's o3-mini across five Reddit-derived datasets. Results show that Few-Shot CoT reliably boosts multi-class classification (e.g., CSSRS, DepSeverity), while CoT and SC-CoT improve binary-task robustness (e.g., Dreaddit, SDCNL); however, zero-shot baselines remain competitive on imbalanced data like RedSam. Compared with traditional transformers (BERT, RoBERTa, FLAN-T5), reasoning-based prompts offer interpretability and targeted gains but do not universally outperform fine-tuned models, underscoring dataset-dependent trade-offs. The work highlights the potential of reasoning-driven LLMs for scalable mental health assessment while outlining challenges in long-text handling, class imbalance, and multi-class nuance, and it suggests hybrid approaches and automatic prompt optimization as promising directions for future research.

Abstract

Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52\% over M-LLM, +0.82\% over BERT) and SDCNL (+4.67\% over M-LLM, +2.17\% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.

Paper Structure

This paper contains 22 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Class Distributions Across Different Mental Health Datasets
  • Figure 2: Classification accuracy comparison between reasoning strategies (CoT, SC-CoT, FS-CoT, ToT) and baseline models (BERT, RoBERTa) across five mental health datasets, showing superior performance of Few-Shot CoT on CSSRS and DepSeverity (multi-class tasks).
  • Figure 3: Macro-averaged accuracy distributions for all models across datasets, demonstrating: (1) 8-12% gains from reasoning strategies in structured datasets (SDCNL/Dreaddit), (2) Zero-shot superiority in RedSam's imbalanced classification.
  • Figure 4: Accuracy heatmap comparing reasoning strategies (columns) against transformer baselines (rows) across five mental health datasets, with darker shading indicating higher performance. Highlights CoT's strong binary classification (Dreaddit/SDCNL) vs FS-CoT's multi-class advantages (CSSRS/DepSeverity).