Table of Contents
Fetching ...

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu

TL;DR

This work proposes a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation, and demonstrates that this approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization.

Abstract

Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, most of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face video. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative and qualitative results on the HDTF and MEAD datasets illustrate the superior performance of our method over existing methods.

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

TL;DR

This work proposes a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation, and demonstrates that this approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization.

Abstract

Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, most of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face video. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative and qualitative results on the HDTF and MEAD datasets illustrate the superior performance of our method over existing methods.
Paper Structure (14 sections, 7 equations, 8 figures, 3 tables)

This paper contains 14 sections, 7 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Given a talking video and another speech, SegTalker can produce high-fidelity and synchronized video with rich textures (row 2), enabling swapping background (row 3) and local editing such as blinking (row 4).
  • Figure 2: Overview of the proposed SegTalker framework for talking face generation. (a) talking segmentation generation (TSG) module takes mel and mask as inputs, then synthesizes the talking segmentation with lip synchronized to input speech. (b) Given reference image and mask from TSG, segmentation-guided GAN injection (SGI) network utilizes a mask-guided multi-scale encoder to extract different semantic region codes, then injects the style codes and synthesized mask from TSG into the mask-guided generator to obtain the final talking face image.
  • Figure 3: Qualitative results on MEAD dataset. Our model demonstrates superior overall performance, particularly in preserving fine-grained textures (e.g. teeth)
  • Figure 4: Qualitative comparisons of our results with several state-of-the-art methods for talking face synthesis. our method produces high-fidelity video frames with rich textural details, while other methods struggle to preserve identity and contain artifacts. It is worth noting that AD-NeRF needs to train on these two identities respectively to produce the results.
  • Figure 5: Visualization of synthesized segmentation(row 1, row 2) and real images(row 2, row 4).
  • ...and 3 more figures