Table of Contents
Fetching ...

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro

TL;DR

AKVSR tackles Visual Speech Recognition (VSR) by addressing the information gap in lip movements with linguistic audio knowledge from a large-scale pretrained model. It builds a compact audio memory by vector-quantizing audio features and stores linguistically relevant representations, then uses an Audio Bridging Module with cross-attention to retrieve and inject this knowledge into the visual stream without requiring audio at inference. The framework is trained with a hybrid loss $L_{tot}=(1-\lambda)L_{att}+\lambda L_{ctc}$ and demonstrates state-of-the-art results on LRS3 across data regimes, while ablations confirm the effectiveness of the memory, the ABM, and the choice of pretrained audio model. The approach offers a scalable, modular path to enhance VSR with robust linguistic audio knowledge, incurring modest parameter overhead and compatibility with various VSR architectures.

Abstract

Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

TL;DR

AKVSR tackles Visual Speech Recognition (VSR) by addressing the information gap in lip movements with linguistic audio knowledge from a large-scale pretrained model. It builds a compact audio memory by vector-quantizing audio features and stores linguistically relevant representations, then uses an Audio Bridging Module with cross-attention to retrieve and inject this knowledge into the visual stream without requiring audio at inference. The framework is trained with a hybrid loss and demonstrates state-of-the-art results on LRS3 across data regimes, while ablations confirm the effectiveness of the memory, the ABM, and the choice of pretrained audio model. The approach offers a scalable, modular path to enhance VSR with robust linguistic audio knowledge, incurring modest parameter overhead and compatibility with various VSR architectures.

Abstract

Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
Paper Structure (25 sections, 9 equations, 2 figures, 9 tables)

This paper contains 25 sections, 9 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Overview of building the compact audio memory to store linguistic information of large-scale pretrained audio model. The audio features are generated by the large-scale pretrained audio model, and the features are transformed into discrete representations in compact audio memory. Sentence Prediction is conducted so as to store linguistic information in the compact audio memory. Note that trained compact audio memory is used in the proposed AKVSR.
  • Figure 2: The overall framework of a proposed AKVSR for complementing visual modality with audio modality. The AKVSR mainly consists of 2 parts: 1) The compact audio memory provides linguistic information from audio knowledge generated by a large-scale pretrained audio model. The meaning of N in compact audio memory is the number of discrete representations in the memory. Moreover, the number of discrete representations is the same as the number of clustering groups of audio features. 2) The proposed ABM finds best-matched information in compact audio memory and injects the linguistic information into the visual feature to complement the insufficient information of lip movements.