Table of Contents
Fetching ...

JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition

Chang Sun, Hong Yang, Bo Qin

TL;DR

VSR faces a performance ceiling due to information loss in the visual modality. The paper presents JEP-KD, a JEPA-based knowledge distillation framework that injects a generator into the embedding space to translate video semantics toward audio semantics, guided by a discriminator and trained with a three-stage multimodal protocol. The approach yields substantial CER improvements on CMLR (e.g., from 19.92% to 14.26%, and to 11.97% with large-scale pretraining), demonstrating the promise of aligning audio and video embeddings and the robustness of the training scheme. While the gap to semantic-recognition models remains, the method shows potential for universality across lip-reading models and for extension to AVSR with teacher guidance.

Abstract

Visual Speech Recognition (VSR) tasks are generally recognized to have a lower theoretical performance ceiling than Automatic Speech Recognition (ASR), owing to the inherent limitations of conveying semantic information visually. To mitigate this challenge, this paper introduces an advanced knowledge distillation approach using a Joint-Embedding Predictive Architecture (JEPA), named JEP-KD, designed to more effectively utilize audio features during model training. Central to JEP-KD is the inclusion of a generative network within the embedding layer, which enhances the video encoder's capacity for semantic feature extraction and brings it into closer alignment with the audio features from a pre-trained ASR model's encoder. This approach aims to progressively reduce the performance gap between VSR and ASR. Moreover, a comprehensive multimodal, multistage training regimen for the JEP-KD framework is established, bolstering the robustness and efficacy of the training process. Experiment results demonstrate that JEP-KD significantly improves the performance of VSR models and demonstrates versatility across different VSR platforms, indicating its potential for broader application within other multimodal tasks.

JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition

TL;DR

VSR faces a performance ceiling due to information loss in the visual modality. The paper presents JEP-KD, a JEPA-based knowledge distillation framework that injects a generator into the embedding space to translate video semantics toward audio semantics, guided by a discriminator and trained with a three-stage multimodal protocol. The approach yields substantial CER improvements on CMLR (e.g., from 19.92% to 14.26%, and to 11.97% with large-scale pretraining), demonstrating the promise of aligning audio and video embeddings and the robustness of the training scheme. While the gap to semantic-recognition models remains, the method shows potential for universality across lip-reading models and for extension to AVSR with teacher guidance.

Abstract

Visual Speech Recognition (VSR) tasks are generally recognized to have a lower theoretical performance ceiling than Automatic Speech Recognition (ASR), owing to the inherent limitations of conveying semantic information visually. To mitigate this challenge, this paper introduces an advanced knowledge distillation approach using a Joint-Embedding Predictive Architecture (JEPA), named JEP-KD, designed to more effectively utilize audio features during model training. Central to JEP-KD is the inclusion of a generative network within the embedding layer, which enhances the video encoder's capacity for semantic feature extraction and brings it into closer alignment with the audio features from a pre-trained ASR model's encoder. This approach aims to progressively reduce the performance gap between VSR and ASR. Moreover, a comprehensive multimodal, multistage training regimen for the JEP-KD framework is established, bolstering the robustness and efficacy of the training process. Experiment results demonstrate that JEP-KD significantly improves the performance of VSR models and demonstrates versatility across different VSR platforms, indicating its potential for broader application within other multimodal tasks.
Paper Structure (10 sections, 5 equations, 2 figures, 1 table)

This paper contains 10 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Comparison of the current knowledge distillation structure with the JEP-KD structure. (a) The knowledge distillation structure commonly used in current VSR tasks. (b) The JEP-KD structure proposed in this paper, which primarily differs from (a) in that JEP-KD introduces a prediction architecture at the embedding layer.
  • Figure 2: Visualization of the VSR model after the introduction of the JEP-KD structure. (a), (b), (c), and (d) are the structural diagrams of the video encoder, generator, discriminator, and decoder, respectively. During the training process, three types of loss functions—L1, CTC, and CE—are used for constraints.