Table of Contents
Fetching ...

A multimodal Bayesian Network for symptom-level depression and anxiety prediction from voice and speech data

Agnes Norbury, George Fairs, Alexandra L. Georgescu, Matthew M. Nour, Emilia Molimpakis, Stefano Goria

TL;DR

It is argued that several important barriers to adoption can be addressed using Bayesian network modelling, and a model for depression and anxiety symptom prediction from voice and speech features in large-scale datasets is evaluated.

Abstract

During psychiatric assessment, clinicians observe not only what patients report, but important nonverbal signs such as tone, speech rate, fluency, responsiveness, and body language. Weighing and integrating these different information sources is a challenging task and a good candidate for support by intelligence-driven tools - however this is yet to be realized in the clinic. Here, we argue that several important barriers to adoption can be addressed using Bayesian network modelling. To demonstrate this, we evaluate a model for depression and anxiety symptom prediction from voice and speech features in large-scale datasets (30,135 unique speakers). Alongside performance for conditions and symptoms (for depression, anxiety ROC-AUC=0.842,0.831 ECE=0.018,0.015; core individual symptom ROC-AUC>0.74), we assess demographic fairness and investigate integration across and redundancy between different input modality types. Clinical usefulness metrics and acceptability to mental health service users are explored. When provided with sufficiently rich and large-scale multimodal data streams and specified to represent common mental conditions at the symptom rather than disorder level, such models are a principled approach for building robust assessment support tools: providing clinically-relevant outputs in a transparent and explainable format that is directly amenable to expert clinical supervision.

A multimodal Bayesian Network for symptom-level depression and anxiety prediction from voice and speech data

TL;DR

It is argued that several important barriers to adoption can be addressed using Bayesian network modelling, and a model for depression and anxiety symptom prediction from voice and speech features in large-scale datasets is evaluated.

Abstract

During psychiatric assessment, clinicians observe not only what patients report, but important nonverbal signs such as tone, speech rate, fluency, responsiveness, and body language. Weighing and integrating these different information sources is a challenging task and a good candidate for support by intelligence-driven tools - however this is yet to be realized in the clinic. Here, we argue that several important barriers to adoption can be addressed using Bayesian network modelling. To demonstrate this, we evaluate a model for depression and anxiety symptom prediction from voice and speech features in large-scale datasets (30,135 unique speakers). Alongside performance for conditions and symptoms (for depression, anxiety ROC-AUC=0.842,0.831 ECE=0.018,0.015; core individual symptom ROC-AUC>0.74), we assess demographic fairness and investigate integration across and redundancy between different input modality types. Clinical usefulness metrics and acceptability to mental health service users are explored. When provided with sufficiently rich and large-scale multimodal data streams and specified to represent common mental conditions at the symptom rather than disorder level, such models are a principled approach for building robust assessment support tools: providing clinically-relevant outputs in a transparent and explainable format that is directly amenable to expert clinical supervision.

Paper Structure

This paper contains 33 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Model overview. Speech activity data (reading out loud and answering a question about recent mood) is used to generate acoustic embeddings, speech timing, and linguistic feature sets (semantic embeddings and Natural Language Processing features), which are fed into relevant surrogate models to generate multiple predictions for each individual depression and anxiety symptom (for details, see \ref{['fig:surrogate-architecture']}). Symptom-level predictions are passed to a Bayesian network, which specifies mapping weights of surrogate predictions to symptom severity estimates, inter-symptom relationships, and symptom severity to overall condition probabilities (simplified sketch of network architecture; for details see \ref{['fig:bn-structure']}). Finally, condition probabilities are passed through a calibration layer to ensure meaningful output scores.
  • Figure 2: Model calibration for overall depression and anxiety and status. Plots represent predicted probability score ranges vs observed positive case rate values across the full range of output condition probabilities in the held-out test set. The dotted line at $y=x$ represents performance of a perfectly calibrated model.
  • Figure 3: Multimodal integration. Example posterior Conditional Probability Distributions (CPDs) for sleep symptom severity states (SleepIssues=0-3) for different surrogate model types (q0=lowest quartile, q3=highest quartile of predicted symptom probability categories for each model).
  • Figure 4: Intervening on network predictions. Example of direct clinician intervention in model predictions based on follow-up discussions with a patient or client, using do-operations. For accompanying vignette, please see main text. Insets show a toy example for a subset of Bayesian Network depression symptom nodes, illustrating the effect of isolating sleep symptom predictions from the network after this has been evaluated as better explained by contextual rather than mental health-related factors.
  • Figure S1: Surrogate model architecture. Architecture of the three surrogate model types. All surrogate models were feedforward neural networks. BN, batch normalization.
  • ...and 3 more figures