Table of Contents
Fetching ...

Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent Transformer

Kepeng Xu, Li Xu, Gang He, Wenxin Yu, Yunsong Li

TL;DR

This work tackles blind video face restoration without pre-alignment by introducing PGTFormer, a parsing-guided temporal-coherent transformer. It combines a temporal-spatial VQGAN (TS-VQGAN) to learn high-quality face priors, a temporal parse-guided codebook predictor (TPCP) that leverages face parsing as position encoding, and a temporal fidelity regulator (TFR) to enforce temporal consistency. The method outperforms state-of-the-art image and video restoration baselines on the VFHQ dataset, with strong quantitative gains in PSNR/SSIM/LPIPS and improved temporal metrics, while also reducing inference time by removing pre-alignment steps. The results indicate robust restoration across poses and degraded inputs, enabling natural, artifact-free video face sequences suitable for practical deployment. The authors also provide extensive ablations to justify each component and release code for reproducibility.

Abstract

Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on \href{https://github.com/kepengxu/PGTFormer}{https://github.com/kepengxu/PGTFormer}.

Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent Transformer

TL;DR

This work tackles blind video face restoration without pre-alignment by introducing PGTFormer, a parsing-guided temporal-coherent transformer. It combines a temporal-spatial VQGAN (TS-VQGAN) to learn high-quality face priors, a temporal parse-guided codebook predictor (TPCP) that leverages face parsing as position encoding, and a temporal fidelity regulator (TFR) to enforce temporal consistency. The method outperforms state-of-the-art image and video restoration baselines on the VFHQ dataset, with strong quantitative gains in PSNR/SSIM/LPIPS and improved temporal metrics, while also reducing inference time by removing pre-alignment steps. The results indicate robust restoration across poses and degraded inputs, enabling natural, artifact-free video face sequences suitable for practical deployment. The authors also provide extensive ablations to justify each component and release code for reproducibility.

Abstract

Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on \href{https://github.com/kepengxu/PGTFormer}{https://github.com/kepengxu/PGTFormer}.
Paper Structure (20 sections, 8 equations, 6 figures, 6 tables)

This paper contains 20 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: (a) Low-quality face input need to be restored. (b) Face geometry-based methods are able to perceive facial structure and can avoid wrong structures, but textures are blurry. (c) The methods based on generative priors can generate detailed textures. However, due to the positional bias of the face structure, artifacts will occur when the input face is not a standard aligned face. (d) Our method has the capability to yield more natural face components in regions prone to artifacts.
  • Figure 2: (a) GAN prior method with pre-alignment. In video face restoration, landmark detection on low-quality faces will produce errors, and inter-frame errors will lead to discontinuous restoration results. (b) Our end-to-end framework. The proposed time-parsing guided Transformer does not require pre-alignment.
  • Figure 3: The workflow of the parsing-guided temporal-coherent transformer (PGTFormer). (a) We first learn a temporal-spatial quantized autoregressive encoder (TS-VQGAN) to enable the codebook and decoder to represent high-quality face video sequences. (b) We input the low-quality video sequence into the low-quality face encoder $E_l$ to obtain the low-quality face latent features $z_l$. Input the low-quality video sequence into the Face Parsing Module $E_p$ to obtain the face parsing features $x_p$. Then $x_p$ and $x_l$ are input into the temporal parsing-guided codebook predictor (TPCP) to predict high-quality video face features. Finally, the low-quality feature $x_e$ is fused with the high-quality feature $x_d$ in face decoder $D_h$. Specifically, we design a temporal fidelity regulator (TFR) to improve the temporal coherence of the face video. The weights of $D_h$ are pre-trained in (a) and fixed in (b).
  • Figure 4: Qualitative comparison. We show the results of aligned and non-aligned face videos. We can see that our method has fewer artifacts and can restore face information more naturally.
  • Figure 5: Previous methods generate artifacts when input with a non-standard pose face.
  • ...and 1 more figures