Table of Contents
Fetching ...

Exploring a Multimodal Fusion-based Deep Learning Network for Detecting Facial Palsy

Heng Yim Nicole Oo, Min Hun Lee, Jeong Hoon Lim

TL;DR

This study examines detecting facial palsy from video by comparing four data modalities derived from RGB frames and facial features, and by evaluating early and late multimodal fusion against modality-specific baselines. Using leave-one-patient-out cross-validation on the YouTube Facial Palsy (YFP) dataset, the authors show that line-segment images and expression features often outperform raw RGB data, and that multimodal fusion can yield small precision gains at the cost of recall. Their results highlight the value of data processing to generate structured representations and of fusion strategies to leverage complementary information, while also noting the need for improved explainability and attention-based architectures for clinical use.

Abstract

Algorithmic detection of facial palsy offers the potential to improve current practices, which usually involve labor-intensive and subjective assessment by clinicians. In this paper, we present a multimodal fusion-based deep learning model that utilizes unstructured data (i.e. an image frame with facial line segments) and structured data (i.e. features of facial expressions) to detect facial palsy. We then contribute to a study to analyze the effect of different data modalities and the benefits of a multimodal fusion-based approach using videos of 21 facial palsy patients. Our experimental results show that among various data modalities (i.e. unstructured data - RGB images and images of facial line segments and structured data - coordinates of facial landmarks and features of facial expressions), the feed-forward neural network using features of facial expression achieved the highest precision of 76.22 while the ResNet-based model using images of facial line segments achieved the highest recall of 83.47. When we leveraged both images of facial line segments and features of facial expressions, our multimodal fusion-based deep learning model slightly improved the precision score to 77.05 at the expense of a decrease in the recall score.

Exploring a Multimodal Fusion-based Deep Learning Network for Detecting Facial Palsy

TL;DR

This study examines detecting facial palsy from video by comparing four data modalities derived from RGB frames and facial features, and by evaluating early and late multimodal fusion against modality-specific baselines. Using leave-one-patient-out cross-validation on the YouTube Facial Palsy (YFP) dataset, the authors show that line-segment images and expression features often outperform raw RGB data, and that multimodal fusion can yield small precision gains at the cost of recall. Their results highlight the value of data processing to generate structured representations and of fusion strategies to leverage complementary information, while also noting the need for improved explainability and attention-based architectures for clinical use.

Abstract

Algorithmic detection of facial palsy offers the potential to improve current practices, which usually involve labor-intensive and subjective assessment by clinicians. In this paper, we present a multimodal fusion-based deep learning model that utilizes unstructured data (i.e. an image frame with facial line segments) and structured data (i.e. features of facial expressions) to detect facial palsy. We then contribute to a study to analyze the effect of different data modalities and the benefits of a multimodal fusion-based approach using videos of 21 facial palsy patients. Our experimental results show that among various data modalities (i.e. unstructured data - RGB images and images of facial line segments and structured data - coordinates of facial landmarks and features of facial expressions), the feed-forward neural network using features of facial expression achieved the highest precision of 76.22 while the ResNet-based model using images of facial line segments achieved the highest recall of 83.47. When we leveraged both images of facial line segments and features of facial expressions, our multimodal fusion-based deep learning model slightly improved the precision score to 77.05 at the expense of a decrease in the recall score.
Paper Structure (13 sections, 1 equation, 2 figures, 1 table)

This paper contains 13 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: (a) A sample RGB image of a patient with facial palsy. (b) 125 3-dimensional coordinates of eyes, nose, and mouth regions overlaid on the RGB image, and (c) line segments of facial expressions on the BnW image
  • Figure 2: Our early fusion model integrates facial expression-based embedding from a feedforward neural network with line segment-based embedding from a pre-trained ResNet50 model to detect a patient with facial palsy