An Attentive Dual-Encoder Framework Leveraging Multimodal Visual and Semantic Information for Automatic OSAHS Diagnosis

Yingchen Wei; Xihe Qiu; Xiaoyu Tan; Jingjing Huang; Wei Chu; Yinghui Xu; Yuan Qi

An Attentive Dual-Encoder Framework Leveraging Multimodal Visual and Semantic Information for Automatic OSAHS Diagnosis

Yingchen Wei, Xihe Qiu, Xiaoyu Tan, Jingjing Huang, Wei Chu, Yinghui Xu, Yuan Qi

TL;DR

This paper addresses the high cost and complexity of diagnosing obstructive sleep apnea-hypopnea syndrome (OSAHS) by proposing VTA-OSAHS, a multimodal dual-encoder framework that fuses facial visual features with semantic physiological data. The architecture combines an image encoder (Attention Mesh with stochastic gates) and a text encoder (Clinical BERT) with a cross-attention fusion module to capture inter-modal relationships, using randomOversampler for class balance and an ordinal regression loss for ordered severity prediction. On a clinical dataset of 500 patients, VTA-OSAHS achieves 91.3% top-1 accuracy and 95.6% AUC across four severity levels, outperforming state-of-the-art image- or text-only baselines and various multimodal models. The approach demonstrates strong diagnostic performance and practical potential to reduce PSG dependence, with future work focusing on optimization and expanding the clinical dataset.

Abstract

Obstructive sleep apnea-hypopnea syndrome (OSAHS) is a common sleep disorder caused by upper airway blockage, leading to oxygen deprivation and disrupted sleep. Traditional diagnosis using polysomnography (PSG) is expensive, time-consuming, and uncomfortable. Existing deep learning methods using facial image analysis lack accuracy due to poor facial feature capture and limited sample sizes. To address this, we propose a multimodal dual encoder model that integrates visual and language inputs for automated OSAHS diagnosis. The model balances data using randomOverSampler, extracts key facial features with attention grids, and converts physiological data into meaningful text. Cross-attention combines image and text data for better feature extraction, and ordered regression loss ensures stable learning. Our approach improves diagnostic efficiency and accuracy, achieving 91.3% top-1 accuracy in a four-class severity classification task, demonstrating state-of-the-art performance. Code will be released upon acceptance.

An Attentive Dual-Encoder Framework Leveraging Multimodal Visual and Semantic Information for Automatic OSAHS Diagnosis

TL;DR

Abstract

An Attentive Dual-Encoder Framework Leveraging Multimodal Visual and Semantic Information for Automatic OSAHS Diagnosis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)