Table of Contents
Fetching ...

Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Chen Chen, Xiaolou Li, Zehua Liu, Lantian Li, Dong Wang

TL;DR

A quantitative analysis based on information theory, focusing on information intersection between different modalities, is presented, showing that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.

Abstract

In the field of spoken language processing, audio-visual speech processing is receiving increasing research attention. Key components of this research include tasks such as lip reading, audio-visual speech recognition, and visual-to-speech synthesis. Although significant success has been achieved, theoretical analysis is still insufficient for audio-visual tasks. This paper presents a quantitative analysis based on information theory, focusing on information intersection between different modalities. Our results show that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.

Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

TL;DR

A quantitative analysis based on information theory, focusing on information intersection between different modalities, is presented, showing that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.

Abstract

In the field of spoken language processing, audio-visual speech processing is receiving increasing research attention. Key components of this research include tasks such as lip reading, audio-visual speech recognition, and visual-to-speech synthesis. Although significant success has been achieved, theoretical analysis is still insufficient for audio-visual tasks. This paper presents a quantitative analysis based on information theory, focusing on information intersection between different modalities. Our results show that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.
Paper Structure (15 sections, 6 equations, 1 figure, 4 tables, 1 algorithm)

This paper contains 15 sections, 6 equations, 1 figure, 4 tables, 1 algorithm.

Figures (1)

  • Figure 1: Information diagram computed based on CNVSRC-Multi, using deep features. Note that only the information in the black box is related to the purpose of conversion/speech. The auditory and visual signals partly represent the purpose but also involve some subtle information that is not clearly shown.