Table of Contents
Fetching ...

Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text Analysis

Wei Zhai, Hongzhi Qi, Qing Zhao, Jianqiang Li, Ziqi Wang, Han Wang, Bing Xiang Yang, Guanghui Fu

TL;DR

This work introduces Chinese MentalBERT, a domain-adaptive pre-trained language model tailored for Chinese mental health text analysis on social media. It extends Chinese-BERT-wwm-ext with a depression lexicon-guided masking mechanism and trains on a large, curated Weibo-based corpus to emphasize domain-specific vocabulary. Empirical results across six mental health tasks show consistent improvements over eight baselines, with qualitative analyses illustrating more psychologically relevant predictions when using guided masking. The model and code are publicly available, offering a practical tool for Chinese mental health NLP while acknowledging privacy constraints on the pretraining data and highlighting limitations and avenues for future clinical data integration.

Abstract

In the current environment, psychological issues are prevalent and widespread, with social media serving as a key outlet for individuals to share their feelings. This results in the generation of vast quantities of data daily, where negative emotions have the potential to precipitate crisis situations. There is a recognized need for models capable of efficient analysis. While pre-trained language models have demonstrated their effectiveness broadly, there's a noticeable gap in pre-trained models tailored for specialized domains like psychology. To address this, we have collected a huge dataset from Chinese social media platforms and enriched it with publicly available datasets to create a comprehensive database encompassing 3.36 million text entries. To enhance the model's applicability to psychological text analysis, we integrated psychological lexicons into the pre-training masking mechanism. Building on an existing Chinese language model, we performed adaptive training to develop a model specialized for the psychological domain. We evaluated our model's performance across six public datasets, where it demonstrated improvements compared to eight other models. Additionally, in the qualitative comparison experiment, our model provided psychologically relevant predictions given the masked sentences. Due to concerns regarding data privacy, the dataset will not be made publicly available. However, we have made the pre-trained models and codes publicly accessible to the community via: https://github.com/zwzzzQAQ/Chinese-MentalBERT.

Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text Analysis

TL;DR

This work introduces Chinese MentalBERT, a domain-adaptive pre-trained language model tailored for Chinese mental health text analysis on social media. It extends Chinese-BERT-wwm-ext with a depression lexicon-guided masking mechanism and trains on a large, curated Weibo-based corpus to emphasize domain-specific vocabulary. Empirical results across six mental health tasks show consistent improvements over eight baselines, with qualitative analyses illustrating more psychologically relevant predictions when using guided masking. The model and code are publicly available, offering a practical tool for Chinese mental health NLP while acknowledging privacy constraints on the pretraining data and highlighting limitations and avenues for future clinical data integration.

Abstract

In the current environment, psychological issues are prevalent and widespread, with social media serving as a key outlet for individuals to share their feelings. This results in the generation of vast quantities of data daily, where negative emotions have the potential to precipitate crisis situations. There is a recognized need for models capable of efficient analysis. While pre-trained language models have demonstrated their effectiveness broadly, there's a noticeable gap in pre-trained models tailored for specialized domains like psychology. To address this, we have collected a huge dataset from Chinese social media platforms and enriched it with publicly available datasets to create a comprehensive database encompassing 3.36 million text entries. To enhance the model's applicability to psychological text analysis, we integrated psychological lexicons into the pre-training masking mechanism. Building on an existing Chinese language model, we performed adaptive training to develop a model specialized for the psychological domain. We evaluated our model's performance across six public datasets, where it demonstrated improvements compared to eight other models. Additionally, in the qualitative comparison experiment, our model provided psychologically relevant predictions given the masked sentences. Due to concerns regarding data privacy, the dataset will not be made publicly available. However, we have made the pre-trained models and codes publicly accessible to the community via: https://github.com/zwzzzQAQ/Chinese-MentalBERT.
Paper Structure (23 sections, 1 figure, 6 tables)

This paper contains 23 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Overview of the domain adaptive pretraining process. The process initiates with the basic pretrained language model (Chinese-BERT-wwm-ext), followed by further pretraining with 3.36 millions mental health posts/comments sourced from social media. The pretraining phase integrates the knowledge from depression lexicon to guide the masking process.