Table of Contents
Fetching ...

PersonaTalk: Bring Attention to Your Persona in Visual Dubbing

Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu

TL;DR

PersonaTalk tackles the challenge of preserving speaker persona in audio-driven visual dubbing while maintaining lip-sync accuracy. It introduces a two-stage framework: Style-Aware Geometry Construction uses a hybrid 3D geometry representation and a cross-attention-based style injection to drive lip-synced geometries from stylized audio features, while Dual-Attention Face Rendering employs Lip-Attention and Face-Attention to texture-render target geometries from carefully selected lip and face reference frames. The approach demonstrates superior visual quality, lip-sync precision, and persona preservation compared with state-of-the-art person-generic methods, and approaches the performance of person-specific baselines without requiring speaker-specific fine-tuning. By leveraging 3D geometry as an intermediate representation and a dual-attention texture sampling strategy, PersonaTalk achieves robust persona transmission and detailed facial rendering, suggesting practical impact for multilingual dubbing and digital-human applications. Limitations include potential artifacts with extreme head poses and non-human avatars, with ethical considerations restricting broad access to core models.

Abstract

For audio-driven visual dubbing, it remains a considerable challenge to uphold and highlight speaker's persona while synthesizing accurate lip synchronization. Existing methods fall short of capturing speaker's unique speaking style or preserving facial details. In this paper, we present PersonaTalk, an attention-based two-stage framework, including geometry construction and face rendering, for high-fidelity and personalized visual dubbing. In the first stage, we propose a style-aware audio encoding module that injects speaking style into audio features through a cross-attention layer. The stylized audio features are then used to drive speaker's template geometry to obtain lip-synced geometries. In the second stage, a dual-attention face renderer is introduced to render textures for the target geometries. It consists of two parallel cross-attention layers, namely Lip-Attention and Face-Attention, which respectively sample textures from different reference frames to render the entire face. With our innovative design, intricate facial details can be well preserved. Comprehensive experiments and user studies demonstrate our advantages over other state-of-the-art methods in terms of visual quality, lip-sync accuracy and persona preservation. Furthermore, as a person-generic framework, PersonaTalk can achieve competitive performance as state-of-the-art person-specific methods. Project Page: https://grisoon.github.io/PersonaTalk/.

PersonaTalk: Bring Attention to Your Persona in Visual Dubbing

TL;DR

PersonaTalk tackles the challenge of preserving speaker persona in audio-driven visual dubbing while maintaining lip-sync accuracy. It introduces a two-stage framework: Style-Aware Geometry Construction uses a hybrid 3D geometry representation and a cross-attention-based style injection to drive lip-synced geometries from stylized audio features, while Dual-Attention Face Rendering employs Lip-Attention and Face-Attention to texture-render target geometries from carefully selected lip and face reference frames. The approach demonstrates superior visual quality, lip-sync precision, and persona preservation compared with state-of-the-art person-generic methods, and approaches the performance of person-specific baselines without requiring speaker-specific fine-tuning. By leveraging 3D geometry as an intermediate representation and a dual-attention texture sampling strategy, PersonaTalk achieves robust persona transmission and detailed facial rendering, suggesting practical impact for multilingual dubbing and digital-human applications. Limitations include potential artifacts with extreme head poses and non-human avatars, with ethical considerations restricting broad access to core models.

Abstract

For audio-driven visual dubbing, it remains a considerable challenge to uphold and highlight speaker's persona while synthesizing accurate lip synchronization. Existing methods fall short of capturing speaker's unique speaking style or preserving facial details. In this paper, we present PersonaTalk, an attention-based two-stage framework, including geometry construction and face rendering, for high-fidelity and personalized visual dubbing. In the first stage, we propose a style-aware audio encoding module that injects speaking style into audio features through a cross-attention layer. The stylized audio features are then used to drive speaker's template geometry to obtain lip-synced geometries. In the second stage, a dual-attention face renderer is introduced to render textures for the target geometries. It consists of two parallel cross-attention layers, namely Lip-Attention and Face-Attention, which respectively sample textures from different reference frames to render the entire face. With our innovative design, intricate facial details can be well preserved. Comprehensive experiments and user studies demonstrate our advantages over other state-of-the-art methods in terms of visual quality, lip-sync accuracy and persona preservation. Furthermore, as a person-generic framework, PersonaTalk can achieve competitive performance as state-of-the-art person-specific methods. Project Page: https://grisoon.github.io/PersonaTalk/.
Paper Structure (12 sections, 4 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 12 sections, 4 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Our proposed method is an attention-based two-stage framework consisting of geometry construction and face rendering. Note that the frames in the green box are selected as the lip-reference, while the frames in the blue box are selected as the face-reference.
  • Figure 2: Qualitative comparisons with Wav2Lip Wav2Lip, VideoRetalking videoretalking, DINet DINet and IP_LAP IP-LAP. The top row is the input (reference) video, the second row is target lip movements with target audio. Our method not only generates accurate lip movements, but also preserves speaker's speaking style and facial details.
  • Figure 3: Self-dubbing results compared with SOTA person-specific method.
  • Figure 4: The t-SNE visualization of our extracted style embeddings. Points of different colors indicate different speakers. "GT" implies original videos and "Generated" represent data generated by our method.