Table of Contents
Fetching ...

Deep Learning for Visual Speech Analysis: A Survey

Changchong Sheng, Gangyao Kuang, Liang Bai, Chenping Hou, Yulan Guo, Xin Xu, Matti Pietikäinen, Li Liu

TL;DR

A comprehensive review of recent progress in deep learning methods on visual speech analysis covering different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance is presented.

Abstract

Visual speech, referring to the visual domain of speech, has attracted increasing attention due to its wide applications, such as public security, medical treatment, military defense, and film entertainment. As a powerful AI strategy, deep learning techniques have extensively promoted the development of visual speech learning. Over the past five years, numerous deep learning based methods have been proposed to address various problems in this area, especially automatic visual speech recognition and generation. To push forward future research on visual speech, this paper aims to present a comprehensive review of recent progress in deep learning methods on visual speech analysis. We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance. Besides, we also identify gaps in current research and discuss inspiring future research directions.

Deep Learning for Visual Speech Analysis: A Survey

TL;DR

A comprehensive review of recent progress in deep learning methods on visual speech analysis covering different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance is presented.

Abstract

Visual speech, referring to the visual domain of speech, has attracted increasing attention due to its wide applications, such as public security, medical treatment, military defense, and film entertainment. As a powerful AI strategy, deep learning techniques have extensively promoted the development of visual speech learning. Over the past five years, numerous deep learning based methods have been proposed to address various problems in this area, especially automatic visual speech recognition and generation. To push forward future research on visual speech, this paper aims to present a comprehensive review of recent progress in deep learning methods on visual speech analysis. We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance. Besides, we also identify gaps in current research and discuss inspiring future research directions.
Paper Structure (38 sections, 7 figures, 4 tables)

This paper contains 38 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Chronological milestones on visual speech analysis from 2016 to the present, including representative VSR and VSG methods, and audio-visual datasets. Handcrafted feature engineering methods dominated VSA until a transition took place in 2016 with the introduction of related deep networks.
  • Figure 2: A taxonomy of representative visual speech recognition and generation methods.
  • Figure 3: The two formal-dual fundamental problems of visual speech analysis. Top part: Visual speech recognition or lip reading; Bottom part: Visual speech generation or lip sequence generation.
  • Figure 4: Main Challenges of visual speech analysis. (a) A taxonomy of main challenges. (b) Some practical examples of different challenges. (b1) The upper and lower lines are the respectively different visual dynamics of the word “wind” under different contexts; (b2) The upper video instance refers to the word “place”, while the lower video refers to the word "please". However, their visual dynamics are very similar; (b3) Two people speak the word "after" respectively, with a noticeable difference in their lip motions; (b4) An example of real-time changes in the head pose of a speaker during talking.
  • Figure 5: Some example images from AVICAR, OuluVS2, Faceforensics++, GRID, LRW, LRS2-BBC, VoxCeleb1, VoxCeleb2, MODALITY, ObamaSet, LRS3-TED, LRW-1000, VOCASET, HDTF, MEAD, MuAViC. See Table. \ref{['Tab:AV-Datasets']} for a summary of these datasets.
  • ...and 2 more figures