Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling
Yuanchao Li, Zixing Zhang, Jing Han, Peter Bell, Catherine Lai
TL;DR
The paper tackles data scarcity in speech-based cognitive-state classification by introducing a semi-supervised learning framework that combines acoustic similarity (via Fréchet Audio Distance) and linguistic reasoning (via LLMs with R3 prompting) to generate high-confidence pseudo-labels. A bimodal classifier fuses audio and text features, and iterative training uses only a fraction of labeled data (as low as 30%) to match or exceed fully supervised performance. Key contributions include the multi-view pseudo-labeling strategy, utilization of FAD for acoustic clustering, and comprehensive fusion method comparisons, demonstrating robustness across emotion recognition and dementia detection tasks. The approach offers practical impact by reducing labeling costs while maintaining high accuracy in clinically relevant speech classification tasks.
Abstract
The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data for training the classification model. Acoustically, unlabeled data are compared to labeled data using the Frechet audio distance, calculated from embeddings generated by multiple audio encoders. Linguistically, large language models are prompted to revise automatic speech recognition transcriptions and predict labels based on our proposed task-specific knowledge. High-confidence data are identified when pseudo-labels from both sources align, while mismatches are treated as low-confidence data. A bimodal classifier is then trained to iteratively label the low-confidence data until a predefined criterion is met. We evaluate our SSL framework on emotion recognition and dementia detection tasks. Experimental results demonstrate that our method achieves competitive performance compared to fully supervised learning using only 30% of the labeled data and significantly outperforms two selected baselines.
