Table of Contents
Fetching ...

CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis

Jinyuan Xu, Tian Lan, Xintao Yu, Xue He, Hezhi Zhang, Ying Wang, Pierre Magistry, Mathieu Valette, Lei Li

TL;DR

CNSocialDepress introduces the first public Chinese-language depression-risk dataset that pairs binary labels with expert-annotated six-dimensional analyses. It combines a manually curated CNSD Gold standard with an automated CNSD Silver pipeline, enabling scalable labeling and structured analysis generation for depression signals on Chinese social media. Through extensive experiments across data generation, structured summarization, and classification using multiple LLMs and baselines, the work demonstrates strong generation quality and competitive classification performance, highlighting the utility of structured psychological profiling for mental health applications in Chinese. The dataset and pipeline offer practical tools for early detection and intervention while acknowledging platform biases, annotation costs, and ethical considerations.

Abstract

Depression is a pressing global public health issue, yet publicly available Chinese-language resources for risk detection remain scarce and are mostly limited to binary classification. To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection from Chinese social media posts. The dataset contains 44,178 texts from 233 users, within which psychological experts annotated 10,306 depression-related segments. CNSocialDepress provides binary risk labels together with structured multi-dimensional psychological attributes, enabling interpretable and fine-grained analysis of depressive signals. Experimental results demonstrate its utility across a wide range of NLP tasks, including structured psychological profiling and fine-tuning of large language models for depression detection. Comprehensive evaluations highlight the dataset's effectiveness and practical value for depression risk identification and psychological analysis, thereby providing insights to mental health applications tailored for Chinese-speaking populations.

CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis

TL;DR

CNSocialDepress introduces the first public Chinese-language depression-risk dataset that pairs binary labels with expert-annotated six-dimensional analyses. It combines a manually curated CNSD Gold standard with an automated CNSD Silver pipeline, enabling scalable labeling and structured analysis generation for depression signals on Chinese social media. Through extensive experiments across data generation, structured summarization, and classification using multiple LLMs and baselines, the work demonstrates strong generation quality and competitive classification performance, highlighting the utility of structured psychological profiling for mental health applications in Chinese. The dataset and pipeline offer practical tools for early detection and intervention while acknowledging platform biases, annotation costs, and ethical considerations.

Abstract

Depression is a pressing global public health issue, yet publicly available Chinese-language resources for risk detection remain scarce and are mostly limited to binary classification. To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection from Chinese social media posts. The dataset contains 44,178 texts from 233 users, within which psychological experts annotated 10,306 depression-related segments. CNSocialDepress provides binary risk labels together with structured multi-dimensional psychological attributes, enabling interpretable and fine-grained analysis of depressive signals. Experimental results demonstrate its utility across a wide range of NLP tasks, including structured psychological profiling and fine-tuning of large language models for depression detection. Comprehensive evaluations highlight the dataset's effectiveness and practical value for depression risk identification and psychological analysis, thereby providing insights to mental health applications tailored for Chinese-speaking populations.

Paper Structure

This paper contains 28 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Dataset Construction Process: During the data annotation process, we used a subset of the original SWDD dataset for labeling, which included 116 Positive (Depressive) users and 117 Negative users. Psychologists used rating scales based on the DSM-5 and the statistical results of the dataset’s text to formulate an initial labeling guideline. Then, during the psychologists’ labeling process, random sampling was continuously performed for cross validation, and the labeling standards were continuously updated based on the annotation outcomes. Ultimately, the labeled gold-standard data was obtained, with each user containing a six-dimensional structural analyses summary.
  • Figure 2: Example entry from the CNSD-Gold dataset
  • Figure 3: Module II Prompt.