Table of Contents
Fetching ...

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu

TL;DR

ReSyncer presents a unified framework for audio-visual lip-sync and virtual performers by rewiring a Style-based generator with 3D facial dynamics predicted by a Style-SyncFormer. The method uses roughly fitted 3D meshes as intermediate guidance and fuses mesh, texture, and identity cues via an attached reference frame to achieve high-fidelity lip-sync, speaking-style transfer, and face-swapping in a single model. It demonstrates superior lip-sync metrics and competitive face-swapping performance on standard benchmarks, while enabling fast personalization and video-driven, cross-identity capabilities. The work advances practical virtual presenter creation with efficient training, unified architecture, and broad applicability, though it notes ethical considerations and limitations in extreme poses.

Abstract

Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping. Resources can be found at https://guanjz20.github.io/projects/ReSyncer.

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

TL;DR

ReSyncer presents a unified framework for audio-visual lip-sync and virtual performers by rewiring a Style-based generator with 3D facial dynamics predicted by a Style-SyncFormer. The method uses roughly fitted 3D meshes as intermediate guidance and fuses mesh, texture, and identity cues via an attached reference frame to achieve high-fidelity lip-sync, speaking-style transfer, and face-swapping in a single model. It demonstrates superior lip-sync metrics and competitive face-swapping performance on standard benchmarks, while enabling fast personalization and video-driven, cross-identity capabilities. The work advances practical virtual presenter creation with efficient training, unified architecture, and broad applicability, though it notes ethical considerations and limitations in extreme poses.

Abstract

Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping. Resources can be found at https://guanjz20.github.io/projects/ReSyncer.
Paper Structure (21 sections, 3 equations, 13 figures, 9 tables)

This paper contains 21 sections, 3 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Lip-Synced/Speaking-Style-Transferred/Face-Swapping Results by ReSyncer. Our method not only produces high-fidelity lip-synced video according to audio but can further transfer the speaking style and identity of any target person.
  • Figure 2: The ReSyncer Framework. In the first stage, the Style-SyncFormer takes the style template and audio input to predict 3D facial dynamics. Then the predicted mesh overlays on the target frame to provide strong spatial guidance. The Style-based generator G processes the overlay and reference frames to produce the final result.
  • Figure 3: Face-Swapping Pipeline. With input data reconfiguration and additional training losses, we achieve lip-syncing and face-swapping simultaneously.
  • Figure 4: Qualitative Cross-Sync Results. The top row shows the lip-synced videos of the driving audio. Generation results based on the "Template" row should have the same lip shape as the "Lip-Synced Video" in the first row.
  • Figure 5: Qualitative Results of Face-Swap. Identity-swapped results should preserve the expression and lip motion of templates.
  • ...and 8 more figures