Test-Time Training for Depression Detection
Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore
TL;DR
The paper addresses robustness gaps in speech-based depression detection caused by distribution shifts between training and real-world testing. It proposes a test-time training approach built on AudioMAE, employing a Y-shaped architecture where the encoder is updated per test sample to minimize a masked-spectrogram reconstruction objective, after an initial train-time phase that trains only a depression-detection head. Evaluations on CLDD and DAIC-WOZ demonstrate that AudioMAE-TTT substantially improves robustness under background noise, gender bias, and cross-dataset conditions, with performance gains increasing with more TTT steps and statistical significance confirmed. The work highlights the practical potential of test-time adaptation for reliable, in-the-wild depression screening using speech.
Abstract
Previous works on depression detection use datasets collected in similar environments to train and test the models. In practice, however, the train and test distributions cannot be guaranteed to be identical. Distribution shifts can be introduced due to variations such as recording environment (e.g., background noise) and demographics (e.g., gender, age, etc). Such distributional shifts can surprisingly lead to severe performance degradation of the depression detection models. In this paper, we analyze the application of test-time training (TTT) to improve robustness of models trained for depression detection. When compared to regular testing of the models, we find TTT can significantly improve the robustness of the model under a variety of distributional shifts introduced due to: (a) background-noise, (b) gender-bias, and (c) data collection and curation procedure (i.e., train and test samples are from separate datasets).
