Deep Learning for Visual Speech Analysis: A Survey

Changchong Sheng; Gangyao Kuang; Liang Bai; Chenping Hou; Yulan Guo; Xin Xu; Matti Pietikäinen; Li Liu

Deep Learning for Visual Speech Analysis: A Survey

Changchong Sheng, Gangyao Kuang, Liang Bai, Chenping Hou, Yulan Guo, Xin Xu, Matti Pietikäinen, Li Liu

TL;DR

A comprehensive review of recent progress in deep learning methods on visual speech analysis covering different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance is presented.

Abstract

Visual speech, referring to the visual domain of speech, has attracted increasing attention due to its wide applications, such as public security, medical treatment, military defense, and film entertainment. As a powerful AI strategy, deep learning techniques have extensively promoted the development of visual speech learning. Over the past five years, numerous deep learning based methods have been proposed to address various problems in this area, especially automatic visual speech recognition and generation. To push forward future research on visual speech, this paper aims to present a comprehensive review of recent progress in deep learning methods on visual speech analysis. We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance. Besides, we also identify gaps in current research and discuss inspiring future research directions.

Deep Learning for Visual Speech Analysis: A Survey

TL;DR

Abstract

Paper Structure (38 sections, 7 figures, 4 tables)

This paper contains 38 sections, 7 figures, 4 tables.

Introduction
The Scope of this Survey
Differences with Related Surveys
Background
The Problems
Main Challenges
Recognition-related Challenges
Generation-Related Challenges
Data-Related Challenges
Datasets and Evaluation Metrics
Datasets
Datasets under controlled environments
Datasets under uncontrolled environments
Evaluation Metrics
Evaluation Metrics on VSR
...and 23 more sections

Figures (7)

Figure 1: Chronological milestones on visual speech analysis from 2016 to the present, including representative VSR and VSG methods, and audio-visual datasets. Handcrafted feature engineering methods dominated VSA until a transition took place in 2016 with the introduction of related deep networks.
Figure 2: A taxonomy of representative visual speech recognition and generation methods.
Figure 3: The two formal-dual fundamental problems of visual speech analysis. Top part: Visual speech recognition or lip reading; Bottom part: Visual speech generation or lip sequence generation.
Figure 4: Main Challenges of visual speech analysis. (a) A taxonomy of main challenges. (b) Some practical examples of different challenges. (b1) The upper and lower lines are the respectively different visual dynamics of the word “wind” under different contexts; (b2) The upper video instance refers to the word “place”, while the lower video refers to the word "please". However, their visual dynamics are very similar; (b3) Two people speak the word "after" respectively, with a noticeable difference in their lip motions; (b4) An example of real-time changes in the head pose of a speaker during talking.
Figure 5: Some example images from AVICAR, OuluVS2, Faceforensics++, GRID, LRW, LRS2-BBC, VoxCeleb1, VoxCeleb2, MODALITY, ObamaSet, LRS3-TED, LRW-1000, VOCASET, HDTF, MEAD, MuAViC. See Table. \ref{['Tab:AV-Datasets']} for a summary of these datasets.
...and 2 more figures

Deep Learning for Visual Speech Analysis: A Survey

TL;DR

Abstract

Deep Learning for Visual Speech Analysis: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (7)