Table of Contents
Fetching ...

Discrete Prior-based Temporal-coherent Content Prediction for Blind Face Video Restoration

Lianxin Xie, Bingbing Zheng, Wen Xue, Yunfei Zhang, Le Jiang, Ruotao Xu, Si Wu, Hau-San Wong

TL;DR

This work tackles blind face video restoration under unknown degradations by introducing DP-TempCoh, a transformer-based framework that leverages two discrete priors: a visual-prior latent bank for content synthesis and a motion-prior statistics bank for temporal coherence. Content prediction uses spatial-temporal context to map degraded frame tokens to high-quality content indices, while motion statistics modulation aligns predicted content with real video statistics through affine normalization guided by cross-frame priors. The model fuses these predictions via cross-attention and decodes them into high-quality, temporally stable video, trained with a combination of perceptual, adversarial, and bank-based supervision. Extensive experiments on synthetic and in-the-wild data show improvements in fidelity, identity preservation, and temporal coherence, outperforming state-of-the-art image- and video-restoration methods and validated by a user study. The approach opens avenues for applying discrete priors and statistic-guided modulation to a broader set of video restoration tasks demanding consistent, high-quality content across time.

Abstract

Blind face video restoration aims to restore high-fidelity details from videos subjected to complex and unknown degradations. This task poses a significant challenge of managing temporal heterogeneity while at the same time maintaining stable face attributes. In this paper, we introduce a Discrete Prior-based Temporal-Coherent content prediction transformer to address the challenge, and our model is referred to as DP-TempCoh. Specifically, we incorporate a spatial-temporal-aware content prediction module to synthesize high-quality content from discrete visual priors, conditioned on degraded video tokens. To further enhance the temporal coherence of the predicted content, a motion statistics modulation module is designed to adjust the content, based on discrete motion priors in terms of cross-frame mean and variance. As a result, the statistics of the predicted content can match with that of real videos over time. By performing extensive experiments, we verify the effectiveness of the design elements and demonstrate the superior performance of our DP-TempCoh in both synthetically and naturally degraded video restoration.

Discrete Prior-based Temporal-coherent Content Prediction for Blind Face Video Restoration

TL;DR

This work tackles blind face video restoration under unknown degradations by introducing DP-TempCoh, a transformer-based framework that leverages two discrete priors: a visual-prior latent bank for content synthesis and a motion-prior statistics bank for temporal coherence. Content prediction uses spatial-temporal context to map degraded frame tokens to high-quality content indices, while motion statistics modulation aligns predicted content with real video statistics through affine normalization guided by cross-frame priors. The model fuses these predictions via cross-attention and decodes them into high-quality, temporally stable video, trained with a combination of perceptual, adversarial, and bank-based supervision. Extensive experiments on synthetic and in-the-wild data show improvements in fidelity, identity preservation, and temporal coherence, outperforming state-of-the-art image- and video-restoration methods and validated by a user study. The approach opens avenues for applying discrete priors and statistic-guided modulation to a broader set of video restoration tasks demanding consistent, high-quality content across time.

Abstract

Blind face video restoration aims to restore high-fidelity details from videos subjected to complex and unknown degradations. This task poses a significant challenge of managing temporal heterogeneity while at the same time maintaining stable face attributes. In this paper, we introduce a Discrete Prior-based Temporal-Coherent content prediction transformer to address the challenge, and our model is referred to as DP-TempCoh. Specifically, we incorporate a spatial-temporal-aware content prediction module to synthesize high-quality content from discrete visual priors, conditioned on degraded video tokens. To further enhance the temporal coherence of the predicted content, a motion statistics modulation module is designed to adjust the content, based on discrete motion priors in terms of cross-frame mean and variance. As a result, the statistics of the predicted content can match with that of real videos over time. By performing extensive experiments, we verify the effectiveness of the design elements and demonstrate the superior performance of our DP-TempCoh in both synthetically and naturally degraded video restoration.
Paper Structure (24 sections, 13 equations, 8 figures, 2 tables)

This paper contains 24 sections, 13 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: An example to visually compare the proposed DP-TempCoh with the competing image/video restoration methods: DiffBIR and FMA-Net, in restoration quality and temporal coherence.
  • Figure 2: Overview of the proposed DP-TempCoh framework. An encoder $E$ extracts the tokens $z$ from a degraded face video segment $v_{lq}$. A latent spatial-temporal-aware content prediction module is applied to $z$ to predict $z'$ that enriched spatial and temporal contextual information. Next, a prior-based motion statistics modulation module modulates the statistics of $z'$ to obtain $z"$. We perform several cross-attention-based transformer computation over $z'$ and $z"$, and feed the resulting feature into a generator $G$ to synthesize a HQ face video $\hat{z}$.
  • Figure 3: Convergence comparison between spatial-temporal-aware (S & T-aware) and spatial-aware (S-aware) prediction loss.
  • Figure 4: Visualization of stable attention maps corresponding to the query of left eye.
  • Figure 5: Visual comparison between DP-TempCoh and ablative models(defined in Table 1) on VFHQ-Test-Deg video.
  • ...and 3 more figures