Table of Contents
Fetching ...

Kalman-Inspired Feature Propagation for Video Face Super-Resolution

Ruicheng Feng, Chongyi Li, Chen Change Loy

TL;DR

This work targets the dual challenges of facial detail fidelity and temporal coherence in video face super-resolution (VFSR). It introduces KEEP, a Kalman-inspired feature propagation framework that maintains a latent face prior over time by recurrently updating a latent state $z_t$ with information from previously restored frames, guided by a learned Kalman gain network. The method formulates a state-space model in latent space, using a CodeFormer-based generative backbone and a Kalman Filter Network to fuse predictive and observed information, with local temporal consistency enforced via cross-frame attention. Empirical results on VFHQ show that KEEP improves both fidelity (PSNR/SSIM/LPIPS) and temporal stability (IDS/AKD) compared to frame-by-frame image-based SR and standard VSR baselines, including robustness to severe degradations and non-frontal views; code and video demos are provided.

Abstract

Despite the promising progress of face image super-resolution, video face super-resolution remains relatively under-explored. Existing approaches either adapt general video super-resolution networks to face datasets or apply established face image super-resolution models independently on individual video frames. These paradigms encounter challenges either in reconstructing facial details or maintaining temporal consistency. To address these issues, we introduce a novel framework called Kalman-inspired Feature Propagation (KEEP), designed to maintain a stable face prior over time. The Kalman filtering principles offer our method a recurrent ability to use the information from previously restored frames to guide and regulate the restoration process of the current frame. Extensive experiments demonstrate the effectiveness of our method in capturing facial details consistently across video frames. Code and video demo are available at https://jnjaby.github.io/projects/KEEP.

Kalman-Inspired Feature Propagation for Video Face Super-Resolution

TL;DR

This work targets the dual challenges of facial detail fidelity and temporal coherence in video face super-resolution (VFSR). It introduces KEEP, a Kalman-inspired feature propagation framework that maintains a latent face prior over time by recurrently updating a latent state with information from previously restored frames, guided by a learned Kalman gain network. The method formulates a state-space model in latent space, using a CodeFormer-based generative backbone and a Kalman Filter Network to fuse predictive and observed information, with local temporal consistency enforced via cross-frame attention. Empirical results on VFHQ show that KEEP improves both fidelity (PSNR/SSIM/LPIPS) and temporal stability (IDS/AKD) compared to frame-by-frame image-based SR and standard VSR baselines, including robustness to severe degradations and non-frontal views; code and video demos are provided.

Abstract

Despite the promising progress of face image super-resolution, video face super-resolution remains relatively under-explored. Existing approaches either adapt general video super-resolution networks to face datasets or apply established face image super-resolution models independently on individual video frames. These paradigms encounter challenges either in reconstructing facial details or maintaining temporal consistency. To address these issues, we introduce a novel framework called Kalman-inspired Feature Propagation (KEEP), designed to maintain a stable face prior over time. The Kalman filtering principles offer our method a recurrent ability to use the information from previously restored frames to guide and regulate the restoration process of the current frame. Extensive experiments demonstrate the effectiveness of our method in capturing facial details consistently across video frames. Code and video demo are available at https://jnjaby.github.io/projects/KEEP.
Paper Structure (15 sections, 11 equations, 9 figures, 3 tables)

This paper contains 15 sections, 11 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Comparing main VFSR strategies. We show seven frames with an interval of $6$. Generic VSR model BasicVSR chan2021basicvsr fails to reconstruct facial components faithfully. Single-image FSR model CodeFormer zhou2022codeformer hallucinates unnatural and inconsistent face details. Our method, in contrast, enables consistent restoration of low-quality face video while preserving temporal coherence across frames.
  • Figure 2: (a) Graphical model of state space. It defines the underlying dynamic system model, where $f$ describes how the latent states $z_t$ transit over time, $g$ is a generative model, and $h$ models the degradation from clean frame $y_t$ to degraded frame $x_t$. (b) Block diagram of Kalman filter model. In each time step, a predictive state from previous frame $\hat{z}^+_{t-1}$ (Blue dash box) and new observed state of current frame $x_t$ (Red dash box) are fused by Kalman gain $\mathcal{K}_t$ from Kalman Gain Network (KGN) to produce more accurate estimates. The combined state $\hat{z}^+_t$ is then used to generate the estimated clean frame $\hat{y}_t$ by $g_{\theta}$. Note that $\Tilde{z}_{1}$ goes along with $\Tilde{z}_{t-1}$ as an anchor and it is omitted in the diagram for simplicity.
  • Figure 3: Overview of the proposed KEEP. It consists of four modules: encoder $\mathcal{E}_L$, decoder $\mathcal{D}_Q$, Kalman filter network, and CFA. We illustrate the information flow in one timestep.
  • Figure 4: Qualitative comparison on the VFHQ. Our KEEP produces high-fidelity face videos with faithful and consistent details. See arrows for details.
  • Figure 5: Comparison of temporal flicker. We select each frame's column (red lines) and show the changes across time. Image-based models (GFPGAN, CodeFormer, and RestoreFormer) have obvious discontinuity around the eyes and wrinkles, and general VSR methods leave artifacts behind. In contrast, by maintaining stable facial priors and aggregating temporal information, our method remarkably suppresses temporal jitters and promotes coherent local details.
  • ...and 4 more figures