STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing

Zijun Ding; Mingdie Xiong; Congcong Zhu; Jingrun Chen

STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing

Zijun Ding, Mingdie Xiong, Congcong Zhu, Jingrun Chen

TL;DR

STSA tackles semantic ambiguity between spatial and temporal domains in audio-driven visual dubbing by introducing a dual-path alignment framework and a differentiable probabilistic heatmap guidance. The Consistent Information Learning (CIL) module maximizes mutual information across scales to align spatial-temporal semantics, while the heatmap guidance provides ambiguity-tolerant cues that enable end-to-end optimization. Empirical results on LRS2 and CMLR show STSA improves image quality and synthesis stability, with strong cross-domain generalization compared to state-of-the-art baselines. The work demonstrates that explicit spatial-temporal semantic alignment and differentiable guidance are key for more realistic and stable visual dubbing in practical applications.

Abstract

Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. To achieve this, we propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability. Pre-trained weights and inference code are available at https://github.com/SCAILab-USTC/STSA.

STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing

TL;DR

Abstract

STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)