Table of Contents
Fetching ...

Multimodal Audio-based Disease Prediction with Transformer-based Hierarchical Fusion Network

Jinjin Cai, Ruiqi Wang, Dezhong Zhao, Ziqin Yuan, Victoria McKenna, Aaron Friedman, Rachel Foot, Susan Storey, Ryan Boente, Sudip Vhaduri, Byung-Cheol Min

TL;DR

This work tackles the challenge of general audio-based disease prediction by integrating multiple bio-acoustic modalities through a transformer-based hierarchical fusion framework, AuD-Former. It jointly learns intra-modal representations and cross-modal complementarities to produce a unified multimodal representation for disease prediction, avoiding heavy feature selection. Across five datasets and three diseases (COVID-19, Parkinson's disease, and pathological dysarthria), AuD-Former achieves state-of-the-art performance and gains robustness via comprehensive ablations and qualitative analyses. The results suggest that simultaneous intra- and inter-modal dependency modeling enhances predictive accuracy and interpretability, offering a scalable backbone for broad audio-based diagnostic tasks with potential clinical impact.

Abstract

Audio-based disease prediction is emerging as a promising supplement to traditional medical diagnosis methods, facilitating early, convenient, and non-invasive disease detection and prevention. Multimodal fusion, which integrates features from various domains within or across bio-acoustic modalities, has proven effective in enhancing diagnostic performance. However, most existing methods in the field employ unilateral fusion strategies that focus solely on either intra-modal or inter-modal fusion. This approach limits the full exploitation of the complementary nature of diverse acoustic feature domains and bio-acoustic modalities. Additionally, the inadequate and isolated exploration of latent dependencies within modality-specific and modality-shared spaces curtails their capacity to manage the inherent heterogeneity in multimodal data. To fill these gaps, we propose a transformer-based hierarchical fusion network designed for general multimodal audio-based disease prediction. Specifically, we seamlessly integrate intra-modal and inter-modal fusion in a hierarchical manner and proficiently encode the necessary intra-modal and inter-modal complementary correlations, respectively. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance in predicting three diseases: COVID-19, Parkinson's disease, and pathological dysarthria, showcasing its promising potential in a broad context of audio-based disease prediction tasks. Additionally, extensive ablation studies and qualitative analyses highlight the significant benefits of each main component within our model.

Multimodal Audio-based Disease Prediction with Transformer-based Hierarchical Fusion Network

TL;DR

This work tackles the challenge of general audio-based disease prediction by integrating multiple bio-acoustic modalities through a transformer-based hierarchical fusion framework, AuD-Former. It jointly learns intra-modal representations and cross-modal complementarities to produce a unified multimodal representation for disease prediction, avoiding heavy feature selection. Across five datasets and three diseases (COVID-19, Parkinson's disease, and pathological dysarthria), AuD-Former achieves state-of-the-art performance and gains robustness via comprehensive ablations and qualitative analyses. The results suggest that simultaneous intra- and inter-modal dependency modeling enhances predictive accuracy and interpretability, offering a scalable backbone for broad audio-based diagnostic tasks with potential clinical impact.

Abstract

Audio-based disease prediction is emerging as a promising supplement to traditional medical diagnosis methods, facilitating early, convenient, and non-invasive disease detection and prevention. Multimodal fusion, which integrates features from various domains within or across bio-acoustic modalities, has proven effective in enhancing diagnostic performance. However, most existing methods in the field employ unilateral fusion strategies that focus solely on either intra-modal or inter-modal fusion. This approach limits the full exploitation of the complementary nature of diverse acoustic feature domains and bio-acoustic modalities. Additionally, the inadequate and isolated exploration of latent dependencies within modality-specific and modality-shared spaces curtails their capacity to manage the inherent heterogeneity in multimodal data. To fill these gaps, we propose a transformer-based hierarchical fusion network designed for general multimodal audio-based disease prediction. Specifically, we seamlessly integrate intra-modal and inter-modal fusion in a hierarchical manner and proficiently encode the necessary intra-modal and inter-modal complementary correlations, respectively. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance in predicting three diseases: COVID-19, Parkinson's disease, and pathological dysarthria, showcasing its promising potential in a broad context of audio-based disease prediction tasks. Additionally, extensive ablation studies and qualitative analyses highlight the significant benefits of each main component within our model.

Paper Structure

This paper contains 23 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of the proposed AuD-Former framework. This illustration showcases the framework using cough, respiration, and speech modalities as example inputs; however, the framework is versatile and can accommodate a variety of bio-audio modalities. Initially, multimodal low-level acoustic features extracted from multiple bio-audio sources undergo temporal and positional embedding processes, resulting in sequences of temporal unimodal features denoted as $\overline{X}_{{1, \cdots, m}}$ (see Section \ref{['TE']}). These sequences are input into an intra-modal representation learning module composed of multiple intra-modal transformer networks. This module produces unimodal representations $\textit{UR}_{{1, \cdots, m}}$, which effectively capture intra-modal dependencies within each modality-specific context (see Section \ref{['Intra']}). Subsequently, these unimodal representations are concatenated and, along with a low-level fusion representation $\textit{FR}_L$, fed into an inter-modal representation learning module. This module constructs a high-level fusion representation $\textit{FR}_H$ that encodes latent cross-modal complementarities within a shared modality space (see Section \ref{['Inter']}). Finally, the high-level fusion representation $\textit{FR}_H$ passes through a prediction layer, consisting of a multi-head attention sub-layer followed by two linear sub-layers, to produce the disease prediction.
  • Figure 2: Illustration of the intra-modal transformer network for modality $m$.
  • Figure 3: Illustration of the cross-modal attention (CA) mechanism and cross-modal transformer network.
  • Figure 4: Comparative visualization of the AuD-Former with implemented baselines (IntraFusion, IntraFusion, EF-LSTM, and LF-LSTM) and ablation models (InterAtt and InterAtt).
  • Figure 5: A t-SNE visualization of the learned representations within each modality-specific space, denoted as $\textcolor{black}{UR}_{m}$, as well as the low-level and high-level modality-shared spaces, represented as $\textcolor{black}{FR}_{L}$ and $\textcolor{black}{FR}_{H}$, in the AuD-Former respectively.
  • ...and 3 more figures