Mental-Perceiver: Audio-Textual Multi-Modal Learning for Estimating Mental Disorders
Jinghui Qin, Changsong Liu, Tianchi Tang, Dahuang Liu, Minghao Wang, Qianying Huang, Rumin Zhang
TL;DR
This work tackles automatic anxiety and depression detection with a previously underexplored, large-scale Mandarin-adolescent dataset. It introduces MMPsy, a multi-modal corpus of audio and transcripts paired with self-report questionnaires, and proposes Mental-Perceiver, a fully attentional, category-prior-guided network for audio-text fusion. The model achieves state-of-the-art results on MMPsy and the DAIC-WOZ benchmark, with ablations showing clear gains from multimodal inputs and semantic priors. The contributions establish a new multilingual benchmark and a robust, scalable approach for mental-health assessment with potential for private self-screening and clinical support across diverse populations.
Abstract
Mental disorders, such as anxiety and depression, have become a global concern that affects people of all ages. Early detection and treatment are crucial to mitigate the negative effects these disorders can have on daily life. Although AI-based detection methods show promise, progress is hindered by the lack of publicly available large-scale datasets. To address this, we introduce the Multi-Modal Psychological assessment corpus (MMPsy), a large-scale dataset containing audio recordings and transcripts from Mandarin-speaking adolescents undergoing automated anxiety/depression assessment interviews. MMPsy also includes self-reported anxiety/depression evaluations using standardized psychological questionnaires. Leveraging this dataset, we propose Mental-Perceiver, a deep learning model for estimating mental disorders from audio and textual data. Extensive experiments on MMPsy and the DAIC-WOZ dataset demonstrate the effectiveness of Mental-Perceiver in anxiety and depression detection.
