Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Yucong Zhang; Xin Zou; Jinshan Yang; Wenjun Chen; Juan Liu; Faya Liang; Ming Li

Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Yucong Zhang, Xin Zou, Jinshan Yang, Wenjun Chen, Juan Liu, Faya Liang, Ming Li

TL;DR

This work introduces MLVAS, a multimodal system for assisted diagnosis of Vocal Fold Paralysis that processes raw laryngoscopic video by jointly leveraging audio (via a pre-trained Dasheng encoder) and visual features (via a two-stage glottis segmentation and LVFDyn/RVFDyn metrics). A front-end identifies complete phonation-cycle segments using audio keyword spotting and HSV-based strobing analysis, while a back-end fuses audio and visual signals through a ConvLSTM to detect VFP and UVFP, distinguishing left from right paralysis. Key innovations include a diffusion-model refinement to reduce glottis segmentation false positives and a quadratic midline fitting to extract dynamic vocal-fold angles, enabling reliable, objective UVFP diagnosis. Experimental results on BAGLS and SYSU datasets demonstrate improved VFP detection, robustness under data scarcity, and interpretable visualizations (GAW vs. VFDyn) that support clinical decision-making and potential workflow integration.

Abstract

This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.

Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

TL;DR

Abstract

Paper Structure (27 sections, 5 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 5 equations, 12 figures, 6 tables, 1 algorithm.

Introduction
Multimodal Extraction of Key Video Segments
System Design
Audio Processing Module
Video Processing Module
Audio Modeling with Pretrained Models
Visual Feature Extraction with Enhanced Glottis Image Segmentation
U-Net-based Method
Diffusion Model-based Refinement
Vocal Fold Dynamics (VFDyn) Extraction with Quadratic Fitting
Multimodal Vocal Fold Paralysis Analysis
Experimental Settings
BAGLS Dataset
SYSU Dataset
Keyword Spotting Model
...and 12 more sections

Figures (12)

Figure 1: The overview of our proposed MLVAS framework.
Figure 2: The overview of the audio processing module. The orange line shows the training process, and the purple line shows the inferring process.
Figure 3: An example result of the HSV analysis for strobing video extraction. The blue line represents the HSV values along the time axis. The yellow line is the unit step function highlighting the empty frames with zero values.
Figure 4: The overview of the second-pass diffusion-based refinement. The denoising process starts from a customized Gaussian noise, utilizing the glottis mask generated by the first-pass U-Net glottis segmentation.
Figure 5: The workflow of computing the angle deviation of left and right vocal folds. Step (a) get the center point $C$ and bottom point $D$. Step (b), (c) and (d) fit the outline of vocal folds with a quadratic function, refining the bottom point to $D^\prime$. As a result, step (e) shows the refined midline $CD^\prime$. Finally, the angle deviation for both vocal folds can be computed.
...and 7 more figures

Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

TL;DR

Abstract

Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Authors

TL;DR

Abstract

Table of Contents

Figures (12)