Exploring the Task-agnostic Trait of Self-supervised Learning in the Context of Detecting Mental Disorders
Rohan Kumar Gupta, Rohit Sinha
TL;DR
The paper examines whether self-supervised learning can yield task-agnostic representations useful for detecting multiple mental disorders from audio and video data. It investigates two SSL paradigms—multi-target prediction (PASE-mod) and masked-frame prediction (AALBERT)—and generates global representations by adjusting temporal context. Results show that these SSL-derived representations outperform corresponding baselines in detecting MDD and PTSD on the DAIC-WOZ dataset, with audio-based PASE-mod and video-based AALBERT delivering notable improvements. This suggests real-world potential for cross-disorder detection using task-agnostic SSL features, while highlighting limitations to two disorders and prompting exploration of additional disorders and models.
Abstract
Self-supervised learning (SSL) has been investigated to generate task-agnostic representations across various domains. However, such investigation has not been conducted for detecting multiple mental disorders. The rationale behind the existence of a task-agnostic representation lies in the overlapping symptoms among multiple mental disorders. Consequently, the behavioural data collected for mental health assessment may carry a mixed bag of attributes related to multiple disorders. Motivated by that, in this study, we explore a task-agnostic representation derived through SSL in the context of detecting major depressive disorder (MDD) and post-traumatic stress disorder (PTSD) using audio and video data collected during interactive sessions. This study employs SSL models trained by predicting multiple fixed targets or masked frames. We propose a list of fixed targets to make the generated representation more efficient for detecting MDD and PTSD. Furthermore, we modify the hyper-parameters of the SSL encoder predicting fixed targets to generate global representations that capture varying temporal contexts. Both these innovations are noted to yield improved detection performances for considered mental disorders and exhibit task-agnostic traits. In the context of the SSL model predicting masked frames, the generated global representations are also noted to exhibit task-agnostic traits.
