Table of Contents
Fetching ...

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization

Young Jin Ahn, Jungwoo Park, Sangha Park, Jonghyun Choi, Kee-Eung Kim

TL;DR

Visual Speech Recognition is hindered by homophenes and limited visual cues. SyncVSR introduces end-to-end crossmodal supervision by predicting discrete audio tokens from video frames and aligning them with visual cues through a non-autoregressive encoder, guided by an audio reconstruction loss. The total objective combines standard VSR losses with an audio-term, enabling frame-level synchronization via quantized audio tokens and improving discrimination of fine-grained phonetic differences. The approach yields state-of-the-art results on word-level English/Chinese benchmarks, strong sentence-level performance with data efficiency (up to $9\times$ less data), and ablations that highlight the benefit of full-sequence synchronization over masked reconstruction. This work advances robust, data-efficient multimodal VSR and broadens applicability across languages and modalities.

Abstract

Visual Speech Recognition (VSR) stands at the intersection of computer vision and speech recognition, aiming to interpret spoken content from visual cues. A prominent challenge in VSR is the presence of homophenes-visually similar lip gestures that represent different phonemes. Prior approaches have sought to distinguish fine-grained visemes by aligning visual and auditory semantics, but often fell short of full synchronization. To address this, we present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision. By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner. SyncVSR shows versatility across tasks, languages, and modalities at the cost of a forward pass. Our empirical evaluations show that it not only achieves state-of-the-art results but also reduces data usage by up to ninefold.

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization

TL;DR

Visual Speech Recognition is hindered by homophenes and limited visual cues. SyncVSR introduces end-to-end crossmodal supervision by predicting discrete audio tokens from video frames and aligning them with visual cues through a non-autoregressive encoder, guided by an audio reconstruction loss. The total objective combines standard VSR losses with an audio-term, enabling frame-level synchronization via quantized audio tokens and improving discrimination of fine-grained phonetic differences. The approach yields state-of-the-art results on word-level English/Chinese benchmarks, strong sentence-level performance with data efficiency (up to less data), and ablations that highlight the benefit of full-sequence synchronization over masked reconstruction. This work advances robust, data-efficient multimodal VSR and broadens applicability across languages and modalities.

Abstract

Visual Speech Recognition (VSR) stands at the intersection of computer vision and speech recognition, aiming to interpret spoken content from visual cues. A prominent challenge in VSR is the presence of homophenes-visually similar lip gestures that represent different phonemes. Prior approaches have sought to distinguish fine-grained visemes by aligning visual and auditory semantics, but often fell short of full synchronization. To address this, we present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision. By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner. SyncVSR shows versatility across tasks, languages, and modalities at the cost of a forward pass. Our empirical evaluations show that it not only achieves state-of-the-art results but also reduces data usage by up to ninefold.
Paper Structure (6 sections, 6 equations, 4 figures, 4 tables)

This paper contains 6 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Performance of SyncVSR on LRS3afouras2018lrs3ted benchmark. SyncVSR outperforms available methods given the similar amount of video data resources. Our method also advances a tier in model size, where our base-size model shows superior performance compared to other large-size models.
  • Figure 2: Overview of the SyncVSR training framework. Given a sequence of video frames, the encoder generates a corresponding sequence of quantized audio tokens in a non-autoregressive manner. $z_t$ denotes audio tokens, and $q(z_t|x)$ is the encoder's prediction through a linear projection layer.
  • Figure 3: The edit distance of word pairs and the model's discriminative ability. Homophene pairs resemble each other closely in graphemes, a scenario where SyncVSR shows better classification performance over the vanilla setting trained without audio information. Non-autoregressive generation with strong audio reconstruction loss weight ($\lambda$) is optimal, whereas masked reconstruction could cause harm in certain instances.
  • Figure 4: Influence of audio reconstruction loss weight ($\lambda$) on the encoder's representation visualized with the mean attention distance dosovitskiy2021an distribution. Each point indicates the weighted distance of attention from the query frame to other frames.