Table of Contents
Fetching ...

Medical Speech Symptoms Classification via Disentangled Representation

Jianzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao

TL;DR

Medical speech symptom classification faces the challenge of extracting symptom-relevant intent from both textual content and acoustic signals. The authors present DRSC, a GAN-based disentanglement framework that splits domain-invariant intent into a shared space $Z_i$ and domain-specific content into $Z_{c_T}$ and $Z_{c_M}$ for text and Mel-spectrogram, then performs cross-domain exchanges and reconstructions to obtain robust representations. The intents from text and Mel-spectrogram are fused via a feature fusion layer and fed to a classifier, with a composite objective that includes $L_{cc}$, $L_{distri}$, $L_{CE}$, $L_{KL}$, $L_{lr}$ and $L_{adv}$. On the Medical Speech, Transcription, Intent dataset, DRSC achieves an average accuracy of $95\%$ across $25$ symptoms and shows robustness to inaccurate transcripts, outperforming traditional SpeechIC baselines. This approach offers a scalable, robust method for multimodal medical symptom diagnosis from speech signals.

Abstract

Intent is defined for understanding spoken language in existing works. Both textual features and acoustic features involved in medical speech contain intent, which is important for symptomatic diagnosis. In this paper, we propose a medical speech classification model named DRSC that automatically learns to disentangle intent and content representations from textual-acoustic data for classification. The intent representations of the text domain and the Mel-spectrogram domain are extracted via intent encoders, and then the reconstructed text feature and the Mel-spectrogram feature are obtained through two exchanges. After combining the intent from two domains into a joint representation, the integrated intent representation is fed into a decision layer for classification. Experimental results show that our model obtains an average accuracy rate of 95% in detecting 25 different medical symptoms.

Medical Speech Symptoms Classification via Disentangled Representation

TL;DR

Medical speech symptom classification faces the challenge of extracting symptom-relevant intent from both textual content and acoustic signals. The authors present DRSC, a GAN-based disentanglement framework that splits domain-invariant intent into a shared space and domain-specific content into and for text and Mel-spectrogram, then performs cross-domain exchanges and reconstructions to obtain robust representations. The intents from text and Mel-spectrogram are fused via a feature fusion layer and fed to a classifier, with a composite objective that includes , , , , and . On the Medical Speech, Transcription, Intent dataset, DRSC achieves an average accuracy of across symptoms and shows robustness to inaccurate transcripts, outperforming traditional SpeechIC baselines. This approach offers a scalable, robust method for multimodal medical symptom diagnosis from speech signals.

Abstract

Intent is defined for understanding spoken language in existing works. Both textual features and acoustic features involved in medical speech contain intent, which is important for symptomatic diagnosis. In this paper, we propose a medical speech classification model named DRSC that automatically learns to disentangle intent and content representations from textual-acoustic data for classification. The intent representations of the text domain and the Mel-spectrogram domain are extracted via intent encoders, and then the reconstructed text feature and the Mel-spectrogram feature are obtained through two exchanges. After combining the intent from two domains into a joint representation, the integrated intent representation is fed into a decision layer for classification. Experimental results show that our model obtains an average accuracy rate of 95% in detecting 25 different medical symptoms.
Paper Structure (22 sections, 11 equations, 4 figures, 4 tables)

This paper contains 22 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Architecture and basic training objectives of DRSC, the model disentangles intent representations from text and Mel-spectrogram domains. Intent information extracted from the two domains is then fused for classification.
  • Figure 2: In the inference stage, we only need to use the trained intent encoder to extract intent representations from two domains for classification.
  • Figure 3: Statistics of symptom types in the dataset.
  • Figure 4: Confusion matrixes of the baseline method's and proposed method's classification results. SpeechIC uses both Mel-spectrogram and text as input, obtaining better results than using Mel-spectrogram or text only.