Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models
Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen
TL;DR
Depression detection from text faces challenges from subtle linguistic cues and the lack of transparent reasoning. The authors introduce a Chain-of-Thought prompting framework that decomposes detection into four stages—emotion analysis, binary depression classification, causal reasoning, and severity assessment—to mirror clinical diagnostics and improve interpretability. Evaluated on the E-DAIC dataset using Concordance Correlation Coefficient (CCC) and Mean Absolute Error (MAE) with severity scored by PHQ-8 in the range $[0,24]$, the approach outperforms traditional prompts and enhances diagnostic granularity. The work demonstrates that structured reasoning within LLMs aligns with clinical workflows and suggests promising extensions to multimodal data for more robust mental health assessment.
Abstract
Depression is one of the leading causes of disability worldwide, posing a severe burden on individuals, healthcare systems, and society at large. Recent advancements in Large Language Models (LLMs) have shown promise in addressing mental health challenges, including the detection of depression through text-based analysis. However, current LLM-based methods often struggle with nuanced symptom identification and lack a transparent, step-by-step reasoning process, making it difficult to accurately classify and explain mental health conditions. To address these challenges, we propose a Chain-of-Thought Prompting approach that enhances both the performance and interpretability of LLM-based depression detection. Our method breaks down the detection process into four stages: (1) sentiment analysis, (2) binary depression classification, (3) identification of underlying causes, and (4) assessment of severity. By guiding the model through these structured reasoning steps, we improve interpretability and reduce the risk of overlooking subtle clinical indicators. We validate our method on the E-DAIC dataset, where we test multiple state-of-the-art large language models. Experimental results indicate that our Chain-of-Thought Prompting technique yields superior performance in both classification accuracy and the granularity of diagnostic insights, compared to baseline approaches.
