Table of Contents
Fetching ...

An Attentive Dual-Encoder Framework Leveraging Multimodal Visual and Semantic Information for Automatic OSAHS Diagnosis

Yingchen Wei, Xihe Qiu, Xiaoyu Tan, Jingjing Huang, Wei Chu, Yinghui Xu, Yuan Qi

TL;DR

This paper addresses the high cost and complexity of diagnosing obstructive sleep apnea-hypopnea syndrome (OSAHS) by proposing VTA-OSAHS, a multimodal dual-encoder framework that fuses facial visual features with semantic physiological data. The architecture combines an image encoder (Attention Mesh with stochastic gates) and a text encoder (Clinical BERT) with a cross-attention fusion module to capture inter-modal relationships, using randomOversampler for class balance and an ordinal regression loss for ordered severity prediction. On a clinical dataset of 500 patients, VTA-OSAHS achieves 91.3% top-1 accuracy and 95.6% AUC across four severity levels, outperforming state-of-the-art image- or text-only baselines and various multimodal models. The approach demonstrates strong diagnostic performance and practical potential to reduce PSG dependence, with future work focusing on optimization and expanding the clinical dataset.

Abstract

Obstructive sleep apnea-hypopnea syndrome (OSAHS) is a common sleep disorder caused by upper airway blockage, leading to oxygen deprivation and disrupted sleep. Traditional diagnosis using polysomnography (PSG) is expensive, time-consuming, and uncomfortable. Existing deep learning methods using facial image analysis lack accuracy due to poor facial feature capture and limited sample sizes. To address this, we propose a multimodal dual encoder model that integrates visual and language inputs for automated OSAHS diagnosis. The model balances data using randomOverSampler, extracts key facial features with attention grids, and converts physiological data into meaningful text. Cross-attention combines image and text data for better feature extraction, and ordered regression loss ensures stable learning. Our approach improves diagnostic efficiency and accuracy, achieving 91.3% top-1 accuracy in a four-class severity classification task, demonstrating state-of-the-art performance. Code will be released upon acceptance.

An Attentive Dual-Encoder Framework Leveraging Multimodal Visual and Semantic Information for Automatic OSAHS Diagnosis

TL;DR

This paper addresses the high cost and complexity of diagnosing obstructive sleep apnea-hypopnea syndrome (OSAHS) by proposing VTA-OSAHS, a multimodal dual-encoder framework that fuses facial visual features with semantic physiological data. The architecture combines an image encoder (Attention Mesh with stochastic gates) and a text encoder (Clinical BERT) with a cross-attention fusion module to capture inter-modal relationships, using randomOversampler for class balance and an ordinal regression loss for ordered severity prediction. On a clinical dataset of 500 patients, VTA-OSAHS achieves 91.3% top-1 accuracy and 95.6% AUC across four severity levels, outperforming state-of-the-art image- or text-only baselines and various multimodal models. The approach demonstrates strong diagnostic performance and practical potential to reduce PSG dependence, with future work focusing on optimization and expanding the clinical dataset.

Abstract

Obstructive sleep apnea-hypopnea syndrome (OSAHS) is a common sleep disorder caused by upper airway blockage, leading to oxygen deprivation and disrupted sleep. Traditional diagnosis using polysomnography (PSG) is expensive, time-consuming, and uncomfortable. Existing deep learning methods using facial image analysis lack accuracy due to poor facial feature capture and limited sample sizes. To address this, we propose a multimodal dual encoder model that integrates visual and language inputs for automated OSAHS diagnosis. The model balances data using randomOverSampler, extracts key facial features with attention grids, and converts physiological data into meaningful text. Cross-attention combines image and text data for better feature extraction, and ordered regression loss ensures stable learning. Our approach improves diagnostic efficiency and accuracy, achieving 91.3% top-1 accuracy in a four-class severity classification task, demonstrating state-of-the-art performance. Code will be released upon acceptance.

Paper Structure

This paper contains 14 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: A potential clinical alternative has been proposed to reduce the use of PSG for diagnosing OSAHS. This method conserves clinical resources, reduces patient waiting times, and cuts down on diagnostic costs.
  • Figure 2: The overall framework of our proposed method. It consists of an image encoder, a text encoder, and a multi-modal fusion module. Sentence: This 37-year-old male has a neck circumference of 42cm, a waist to hip ratio of 0.9, a body mass index of 32, indicating that he is obesity, and not history of hypertension, diabetes, heart disease, and hyperlipidemia.