Table of Contents
Fetching ...

Uncertainty Modeling in Multimodal Speech Analysis Across the Psychosis Spectrum

Morteza Rohanian, Roya M. Hüppi, Farhad Nooralahzadeh, Noemi Dannecker, Yves Pauli, Werner Surbeck, Iris Sommer, Wolfram Hinzen, Nicolas Langer, Michael Krauthammer, Philipp Homan

TL;DR

This study addresses robust detection of psychosis-related speech disruptions across the continuum by incorporating data- and model-uncertainty into a multimodal analysis. It introduces an uncertainty-aware Temporal Context Fusion (TCF) that models unimodal latent distributions $h_i^M \sim \mathcal{N}(\mu_i^M, \Sigma_i^M)$, derives fusion weights from inverse variances $w_i^A$ and $w_i^T$, and optimizes a calibration-ordinality loss $L_{CO}$ to align uncertainties with errors. The approach fuses traditional and deep acoustic features ($eGeMAPS$, DEEPSPECTRUM, wav2vec 2.0) with text embeddings from PELICAN/XLM-RoBERTa, evaluated on 114 German-speaking participants including early psychosis and schizotypy groups, achieving $ECE = 4.5 \times 10^{-2}$ and $F1 = 0.83$. Results show improved prediction accuracy and reliability across structured, semi-structured, and narrative tasks, with strong cross-context generalization, suggesting practical utility for early detection and personalized assessment within the psychosis spectrum.

Abstract

Capturing subtle speech disruptions across the psychosis spectrum is challenging because of the inherent variability in speech patterns. This variability reflects individual differences and the fluctuating nature of symptoms in both clinical and non-clinical populations. Accounting for uncertainty in speech data is essential for predicting symptom severity and improving diagnostic precision. Speech disruptions characteristic of psychosis appear across the spectrum, including in non-clinical individuals. We develop an uncertainty-aware model integrating acoustic and linguistic features to predict symptom severity and psychosis-related traits. Quantifying uncertainty in specific modalities allows the model to address speech variability, improving prediction accuracy. We analyzed speech data from 114 participants, including 32 individuals with early psychosis and 82 with low or high schizotypy, collected through structured interviews, semi-structured autobiographical tasks, and narrative-driven interactions in German. The model improved prediction accuracy, reducing RMSE and achieving an F1-score of 83% with ECE = 4.5e-2, showing robust performance across different interaction contexts. Uncertainty estimation improved model interpretability by identifying reliability differences in speech markers such as pitch variability, fluency disruptions, and spectral instability. The model dynamically adjusted to task structures, weighting acoustic features more in structured settings and linguistic features in unstructured contexts. This approach strengthens early detection, personalized assessment, and clinical decision-making in psychosis-spectrum research.

Uncertainty Modeling in Multimodal Speech Analysis Across the Psychosis Spectrum

TL;DR

This study addresses robust detection of psychosis-related speech disruptions across the continuum by incorporating data- and model-uncertainty into a multimodal analysis. It introduces an uncertainty-aware Temporal Context Fusion (TCF) that models unimodal latent distributions , derives fusion weights from inverse variances and , and optimizes a calibration-ordinality loss to align uncertainties with errors. The approach fuses traditional and deep acoustic features (, DEEPSPECTRUM, wav2vec 2.0) with text embeddings from PELICAN/XLM-RoBERTa, evaluated on 114 German-speaking participants including early psychosis and schizotypy groups, achieving and . Results show improved prediction accuracy and reliability across structured, semi-structured, and narrative tasks, with strong cross-context generalization, suggesting practical utility for early detection and personalized assessment within the psychosis spectrum.

Abstract

Capturing subtle speech disruptions across the psychosis spectrum is challenging because of the inherent variability in speech patterns. This variability reflects individual differences and the fluctuating nature of symptoms in both clinical and non-clinical populations. Accounting for uncertainty in speech data is essential for predicting symptom severity and improving diagnostic precision. Speech disruptions characteristic of psychosis appear across the spectrum, including in non-clinical individuals. We develop an uncertainty-aware model integrating acoustic and linguistic features to predict symptom severity and psychosis-related traits. Quantifying uncertainty in specific modalities allows the model to address speech variability, improving prediction accuracy. We analyzed speech data from 114 participants, including 32 individuals with early psychosis and 82 with low or high schizotypy, collected through structured interviews, semi-structured autobiographical tasks, and narrative-driven interactions in German. The model improved prediction accuracy, reducing RMSE and achieving an F1-score of 83% with ECE = 4.5e-2, showing robust performance across different interaction contexts. Uncertainty estimation improved model interpretability by identifying reliability differences in speech markers such as pitch variability, fluency disruptions, and spectral instability. The model dynamically adjusted to task structures, weighting acoustic features more in structured settings and linguistic features in unstructured contexts. This approach strengthens early detection, personalized assessment, and clinical decision-making in psychosis-spectrum research.

Paper Structure

This paper contains 36 sections, 10 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: KDE plots with mean lines for different psychological measures. The top two subplots (a, b) represent PANSS Positive and Negative scores across groups (low schizotypy, high schizotypy, and patients). The bottom two subplots (c, d) represent MSS and O-LIFE scores across all participants. The dashed vertical lines indicate the mean values for each group (a, b) or each subscale (c, d). The number of participants in each group is as follows: low schizotypy (n = 45), high schizotypy (n = 37), and patients (n = 32).
  • Figure 2: Comparison of model performance and fusion weights across different tasks and modalities. The top row shows the RMSE values for speech, text, and fusion models across semi-structured interviews and discourse sessions. The bottom row shows the fusion weights of speech and text modalities, highlighting the balance between modalities.
  • Figure 3: Top features across schizotypal traits and symptoms, organized by Positive, Negative, and Disorganized dimensions. Red triangles indicate features with a 95% confidence interval that does not include zero.
  • Figure 4: Heatmap of SHAP values for various LIWC features across different targets. Features are sorted by their average SHAP importance, and significant values are marked with red stars. The targets include PANSS (Positive, Negative), MSS with Cognitive-Perceptual (CP), Interpersonal (IP), and Disorganization (DO), and O-LIFE with Unusual Experiences (UE), Introvertive Anhedonia (IA), Cognitive Disorganization (CD), and Impulsive Nonconformity (IN).
  • Figure 5: Kernel Density Estimation (KDE) plots with mean lines for different psychological measures across subtypes. Each subplot represents a distinct MSS or O-LIFE subscale. The three groups displayed are low schizotypy, high schizotypy, and patients. Dashed vertical lines indicate the mean values for each group within each subplot. The number of participants in each group is as follows: low schizotypy ($n = 45$), high schizotypy ($n = 37$), and patients ($n = 32$).