Table of Contents
Fetching ...

VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

Shulian Zhang, Yong Guo, Long Peng, Ziyang Wang, Ye Chen, Wenbo Li, Xiao Zhang, Yulun Zhang, Jian Chen

TL;DR

VividFace introduces a one-step diffusion framework for video face enhancement that replaces the traditional multi-step sampling with a flow-matching approach built on the WANX video model, achieving about $12\times$ speedup without sacrificing fidelity. It couples a Joint Latent-Pixel Face-Focused Training strategy with spatiotemporally aligned facial masks and a two-stage optimization to sharpen facial details while preserving global video quality. An automated, MLLM-driven pipeline curates a high-quality dataset, MLLM-Face90, which enhances facial texture learning and generalization. Comprehensive experiments on synthetic and real-world benchmarks show superior perceptual quality, identity preservation, and temporal consistency, demonstrating the practicality and robustness of the approach; code, models, and the dataset will be publicly released.

Abstract

Video Face Enhancement (VFE) aims to restore high-quality facial regions from degraded video sequences, enabling a wide range of practical applications. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) computational inefficiency caused by iterative multi-step denoising in diffusion models; (2) faithfully modeling intricate facial textures while preserving temporal consistency; and (3) limited model generalization due to the lack of high-quality face video training data. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for VFE. Built upon the pretrained WANX video generation model, VividFace reformulates the traditional multi-step diffusion process as a single-step flow matching paradigm that directly maps degraded inputs to high-quality outputs with significantly reduced inference time. To enhance facial detail recovery, we introduce a Joint Latent-Pixel Face-Focused Training strategy that constructs spatiotemporally aligned facial masks to guide optimization toward critical facial regions in both latent and pixel spaces. Furthermore, we develop an MLLM-driven automated filtering pipeline that produces MLLM-Face90, a meticulously curated high-quality face video dataset, ensuring models learn from photorealistic facial textures. Extensive experiments demonstrate that VividFace achieves superior performance in perceptual quality, identity preservation, and temporal consistency across both synthetic and real-world benchmarks. We will publicly release our code, models, and dataset to support future research.

VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

TL;DR

VividFace introduces a one-step diffusion framework for video face enhancement that replaces the traditional multi-step sampling with a flow-matching approach built on the WANX video model, achieving about speedup without sacrificing fidelity. It couples a Joint Latent-Pixel Face-Focused Training strategy with spatiotemporally aligned facial masks and a two-stage optimization to sharpen facial details while preserving global video quality. An automated, MLLM-driven pipeline curates a high-quality dataset, MLLM-Face90, which enhances facial texture learning and generalization. Comprehensive experiments on synthetic and real-world benchmarks show superior perceptual quality, identity preservation, and temporal consistency, demonstrating the practicality and robustness of the approach; code, models, and the dataset will be publicly released.

Abstract

Video Face Enhancement (VFE) aims to restore high-quality facial regions from degraded video sequences, enabling a wide range of practical applications. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) computational inefficiency caused by iterative multi-step denoising in diffusion models; (2) faithfully modeling intricate facial textures while preserving temporal consistency; and (3) limited model generalization due to the lack of high-quality face video training data. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for VFE. Built upon the pretrained WANX video generation model, VividFace reformulates the traditional multi-step diffusion process as a single-step flow matching paradigm that directly maps degraded inputs to high-quality outputs with significantly reduced inference time. To enhance facial detail recovery, we introduce a Joint Latent-Pixel Face-Focused Training strategy that constructs spatiotemporally aligned facial masks to guide optimization toward critical facial regions in both latent and pixel spaces. Furthermore, we develop an MLLM-driven automated filtering pipeline that produces MLLM-Face90, a meticulously curated high-quality face video dataset, ensuring models learn from photorealistic facial textures. Extensive experiments demonstrate that VividFace achieves superior performance in perceptual quality, identity preservation, and temporal consistency across both synthetic and real-world benchmarks. We will publicly release our code, models, and dataset to support future research.

Paper Structure

This paper contains 15 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The left side shows a visual comparison between VividFace and existing video face restoration methods, illustrating that VividFace produces highly realistic and visually pleasing human eyes. The right side compares model inference time, parameter count, and IDS performance across different methods. VividFace achieves best performance, fastest speed, and comparable model parameter.
  • Figure 2: Overview of our proposed VividFace training framework. VividFace is a one-step diffusion method built upon the powerful WANX video model. It adopts a two-stage design that integrates latent and pixel-space optimization, leveraging spatiotemporal priors and stochastic training to simultaneously enhance facial details and overall video quality.
  • Figure 3: Pipeline of the proposed MLLM-driven high-quality face video filtering. First, face regions are extracted and cropped to facilitate the model's focus on facial features. Next, a meticulously designed set of visual quality assessment prompts is utilized to evaluate each video from multiple quality perspectives using the powerful Qwen2.5-VL.
  • Figure 4: Visual comparison with existing methods on VFHQ-test. VividFace exhibits more realistic and visually pleasing facial details, and produces results that are closer to the ground truth.
  • Figure 5: Qualitative comparison on real-world RFV-LQ dataset. The results highlight VividFace's strong capability to address complex real-world degradations.
  • ...and 2 more figures