Table of Contents
Fetching ...

Adaptive Super Resolution For One-Shot Talking-Head Generation

Luchuan Song, Pinxin Liu, Guojun Yin, Chenliang Xu

TL;DR

This work proposes an adaptive high-quality talking-head video generation method, which synthesizes high-resolution video without additional pre-trained modules, and inspired by existing super-resolution methods down-sample the one-shot source image, and then adaptively reconstruct high-frequency details via an encoder-decoder module, resulting in enhanced video clarity.

Abstract

The one-shot talking-head generation learns to synthesize a talking-head video with one source portrait image under the driving of same or different identity video. Usually these methods require plane-based pixel transformations via Jacobin matrices or facial image warps for novel poses generation. The constraints of using a single image source and pixel displacements often compromise the clarity of the synthesized images. Some methods try to improve the quality of synthesized videos by introducing additional super-resolution modules, but this will undoubtedly increase computational consumption and destroy the original data distribution. In this work, we propose an adaptive high-quality talking-head video generation method, which synthesizes high-resolution video without additional pre-trained modules. Specifically, inspired by existing super-resolution methods, we down-sample the one-shot source image, and then adaptively reconstruct high-frequency details via an encoder-decoder module, resulting in enhanced video clarity. Our method consistently improves the quality of generated videos through a straightforward yet effective strategy, substantiated by quantitative and qualitative evaluations. The code and demo video are available on: \url{https://github.com/Songluchuan/AdaSR-TalkingHead/}.

Adaptive Super Resolution For One-Shot Talking-Head Generation

TL;DR

This work proposes an adaptive high-quality talking-head video generation method, which synthesizes high-resolution video without additional pre-trained modules, and inspired by existing super-resolution methods down-sample the one-shot source image, and then adaptively reconstruct high-frequency details via an encoder-decoder module, resulting in enhanced video clarity.

Abstract

The one-shot talking-head generation learns to synthesize a talking-head video with one source portrait image under the driving of same or different identity video. Usually these methods require plane-based pixel transformations via Jacobin matrices or facial image warps for novel poses generation. The constraints of using a single image source and pixel displacements often compromise the clarity of the synthesized images. Some methods try to improve the quality of synthesized videos by introducing additional super-resolution modules, but this will undoubtedly increase computational consumption and destroy the original data distribution. In this work, we propose an adaptive high-quality talking-head video generation method, which synthesizes high-resolution video without additional pre-trained modules. Specifically, inspired by existing super-resolution methods, we down-sample the one-shot source image, and then adaptively reconstruct high-frequency details via an encoder-decoder module, resulting in enhanced video clarity. Our method consistently improves the quality of generated videos through a straightforward yet effective strategy, substantiated by quantitative and qualitative evaluations. The code and demo video are available on: \url{https://github.com/Songluchuan/AdaSR-TalkingHead/}.
Paper Structure (13 sections, 1 equation, 3 figures, 1 table)

This paper contains 13 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of our whole pipeline. (a) We apply pretrained and frozen (snowflake) modules to obtain images of different quality. (2) The pipeline of our framework, the burning represent participation in learnable training. It is worth noting that images with borders of different colors (green and red) form a set of training pairs.
  • Figure 2: Qualitative comparison with the baseline methods on the videos from HDTF dataset zhang2021flow. The left part is under the same identity, while the right is cross identity. We zoom in the facial details on the each left. A red arrow indicates incorrect head posture, and the ground truth is on the top. We highly recommend watching our https://www.youtube.com/watch?v=B_-3F51QmKE&t=1s for more comparisons.
  • Figure 3: The visualization of features from each layers in generator w/wo adaptive high-frequency encoder $E$. The Feature-Layer-1 and 2 are the features before deformation, Feature-Layer-3 and 4 are the features after deformation.