Table of Contents
Fetching ...

Enhancing Visual Forced Alignment with Local Context-Aware Feature Extraction and Multi-Task Learning

Yi He, Lei Yang, Shilin Wang

TL;DR

The paper tackles Visual Forced Alignment (VFA) without audio cues, aiming to synchronize lip movements with text at word- and phoneme-level timing. It introduces the Cross Global-Local Conformer (CGL-Conformer) encoder, which fuses video and text via cross-attention and balances global long-range context with a local windowed attention ($W=16$, kernel $k=3$) merged by Max Feature Map. Complementing this, a multi-task learning framework jointly optimizes frame-level, boundary, and silence-aware text predictions with a total loss $L = L_F + L_B + L_S$, and a Viterbi post-processing step enforces sequence consistency using sil_aware predictions and boundary cues. Experiments on LRS2 and LRS3 demonstrate state-of-the-art performance, including substantial reductions in MAE and gains in ACC, particularly at the phoneme level (e.g., MAE down by ~76% and ACC up by ~27% on LRS2), highlighting strong potential for automatic subtitling of user-generated and broadcast content. These advances offer robust lip-synchronization capabilities for real-world visual-subtitling and accessibility applications.

Abstract

This paper introduces a novel approach to Visual Forced Alignment (VFA), aiming to accurately synchronize utterances with corresponding lip movements, without relying on audio cues. We propose a novel VFA approach that integrates a local context-aware feature extractor and employs multi-task learning to refine both global and local context features, enhancing sensitivity to subtle lip movements for precise word-level and phoneme-level alignment. Incorporating the improved Viterbi algorithm for post-processing, our method significantly reduces misalignments. Experimental results show our approach outperforms existing methods, achieving a 6% accuracy improvement at the word-level and 27% improvement at the phoneme-level in LRS2 dataset. These improvements offer new potential for applications in automatically subtitling TV shows or user-generated content platforms like TikTok and YouTube Shorts.

Enhancing Visual Forced Alignment with Local Context-Aware Feature Extraction and Multi-Task Learning

TL;DR

The paper tackles Visual Forced Alignment (VFA) without audio cues, aiming to synchronize lip movements with text at word- and phoneme-level timing. It introduces the Cross Global-Local Conformer (CGL-Conformer) encoder, which fuses video and text via cross-attention and balances global long-range context with a local windowed attention (, kernel ) merged by Max Feature Map. Complementing this, a multi-task learning framework jointly optimizes frame-level, boundary, and silence-aware text predictions with a total loss , and a Viterbi post-processing step enforces sequence consistency using sil_aware predictions and boundary cues. Experiments on LRS2 and LRS3 demonstrate state-of-the-art performance, including substantial reductions in MAE and gains in ACC, particularly at the phoneme level (e.g., MAE down by ~76% and ACC up by ~27% on LRS2), highlighting strong potential for automatic subtitling of user-generated and broadcast content. These advances offer robust lip-synchronization capabilities for real-world visual-subtitling and accessibility applications.

Abstract

This paper introduces a novel approach to Visual Forced Alignment (VFA), aiming to accurately synchronize utterances with corresponding lip movements, without relying on audio cues. We propose a novel VFA approach that integrates a local context-aware feature extractor and employs multi-task learning to refine both global and local context features, enhancing sensitivity to subtle lip movements for precise word-level and phoneme-level alignment. Incorporating the improved Viterbi algorithm for post-processing, our method significantly reduces misalignments. Experimental results show our approach outperforms existing methods, achieving a 6% accuracy improvement at the word-level and 27% improvement at the phoneme-level in LRS2 dataset. These improvements offer new potential for applications in automatically subtitling TV shows or user-generated content platforms like TikTok and YouTube Shorts.

Paper Structure

This paper contains 11 sections, 2 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Architecture of the proposed method.
  • Figure 2: Visualization of the alignment result.