Table of Contents
Fetching ...

Testing MediaPipe Holistic for Linguistic Analysis of Nonmanual Markers in Sign Languages

Anna Kuznetsova, Vadim Kimmelman

TL;DR

This work assesses whether MediaPipe Holistic can support linguistic analysis of nonmanual markers in sign languages by comparing it to OpenFace on KRSL data and a targeted head-tilt/eyebrow dataset. The authors analyze eyebrow-position signals, including head-pitch distortions, and find that MPH introduces complex, direction- and distance-dependent distortions that obscure genuine eyebrow patterns, unlike OF which shows different systematic distortions that can be corrected with a prior model. The study demonstrates that MPH, in its current form, cannot be directly used for reliable linguistic analysis without substantial corrective modeling, and it highlights the need for robust, generalized distortion-correction pipelines for CV landmarks in sign-language research. These findings caution researchers about relying on MPH for nonmanual marker analyses and motivate development of tailored correction methods to enable scalable, automated linguistics research in sign languages.

Abstract

Advances in Deep Learning have made possible reliable landmark tracking of human bodies and faces that can be used for a variety of tasks. We test a recent Computer Vision solution, MediaPipe Holistic (MPH), to find out if its tracking of the facial features is reliable enough for a linguistic analysis of data from sign languages, and compare it to an older solution (OpenFace, OF). We use an existing data set of sentences in Kazakh-Russian Sign Language and a newly created small data set of videos with head tilts and eyebrow movements. We find that MPH does not perform well enough for linguistic analysis of eyebrow movement - but in a different way from OF, which is also performing poorly without correction. We reiterate a previous proposal to train additional correction models to overcome these limitations.

Testing MediaPipe Holistic for Linguistic Analysis of Nonmanual Markers in Sign Languages

TL;DR

This work assesses whether MediaPipe Holistic can support linguistic analysis of nonmanual markers in sign languages by comparing it to OpenFace on KRSL data and a targeted head-tilt/eyebrow dataset. The authors analyze eyebrow-position signals, including head-pitch distortions, and find that MPH introduces complex, direction- and distance-dependent distortions that obscure genuine eyebrow patterns, unlike OF which shows different systematic distortions that can be corrected with a prior model. The study demonstrates that MPH, in its current form, cannot be directly used for reliable linguistic analysis without substantial corrective modeling, and it highlights the need for robust, generalized distortion-correction pipelines for CV landmarks in sign-language research. These findings caution researchers about relying on MPH for nonmanual marker analyses and motivate development of tailored correction methods to enable scalable, automated linguistics research in sign languages.

Abstract

Advances in Deep Learning have made possible reliable landmark tracking of human bodies and faces that can be used for a variety of tasks. We test a recent Computer Vision solution, MediaPipe Holistic (MPH), to find out if its tracking of the facial features is reliable enough for a linguistic analysis of data from sign languages, and compare it to an older solution (OpenFace, OF). We use an existing data set of sentences in Kazakh-Russian Sign Language and a newly created small data set of videos with head tilts and eyebrow movements. We find that MPH does not perform well enough for linguistic analysis of eyebrow movement - but in a different way from OF, which is also performing poorly without correction. We reiterate a previous proposal to train additional correction models to overcome these limitations.
Paper Structure (12 sections, 7 figures, 2 tables)

This paper contains 12 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Head models based on OF outputs demonstrating distortion due to head pitch.
  • Figure 2: Eyebrow marking for different types of sentences in the KRSL data set. Top panel: MPH. Middle panel: OF corrected. Lower panel: OF non-corrected. Colors represent sentence types (orange: polar question, green: content question, purple: statement). N: noun, V: verb. Left column: inner eyebrow distance, right column: outer eyebrow distance. X-axis is frame number normalized to 70.
  • Figure 3: Head models based on MPH output, for close pitch up and close pitch down videos. Grey line: no pitch; red line: middle of pitch motion; yellow line: maximal pitch.
  • Figure 4: MPH eyebrow distance estimation for inner eyebrows in the close condition; head pitch up or down; with or without raised eyebrows. The distance and frame numbers are standardized.
  • Figure 5: MPH eyebrow distance (colored dots) and head pitch (black lines) estimation for inner eyebrows in the close condition; head pitch up with or without raised eyebrows. The distances and frame numbers are standardized.
  • ...and 2 more figures