Table of Contents
Fetching ...

Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment

Zhihao Zhan, Wang Pang, Xiang Zhu, Yechao Bai

TL;DR

Video super-resolution is reformulated as an inverse problem solved by Diffusion Posterior Sampling (DPS) using an unconditional video diffusion transformer operating in latent space. The method maps HR videos to a latent Z via a VAE, degrades them with a differentiable operator H, and denoises through a DiT-based unconditional diffusion model within the DPS framework to recover high-fidelity frames without explicit motion alignment. The key contribution is latent-space diffusion for 3D video priors combined with frame-degradation consistency, enabling alignment-free VSR that adapts to varying sampling conditions without retraining. Empirical results on synthetic Moving MNIST and real BAIR data demonstrate that inter-frame information improves restoration, with substantial gains over motion-estimation baselines and robustness to aliasing as the number of observed frames increases.

Abstract

In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.

Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment

TL;DR

Video super-resolution is reformulated as an inverse problem solved by Diffusion Posterior Sampling (DPS) using an unconditional video diffusion transformer operating in latent space. The method maps HR videos to a latent Z via a VAE, degrades them with a differentiable operator H, and denoises through a DiT-based unconditional diffusion model within the DPS framework to recover high-fidelity frames without explicit motion alignment. The key contribution is latent-space diffusion for 3D video priors combined with frame-degradation consistency, enabling alignment-free VSR that adapts to varying sampling conditions without retraining. Empirical results on synthetic Moving MNIST and real BAIR data demonstrate that inter-frame information improves restoration, with substantial gains over motion-estimation baselines and robustness to aliasing as the number of observed frames increases.

Abstract

In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.

Paper Structure

This paper contains 10 sections, 15 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the Video Diffusion Model based VSR (VDM-VSR): In the core iteration, the estimated 3D HR video resides in the latent space, represented by green boxes. It is generated and refined by the VDM, which includes several Transformer blocks, as shown in the structural diagram on the right. The latent video is then decoded and compared with the LR observations through the degradation model, indicated by red boxes. The discrepancies between these observations and the latent video are used to correct and enhance the HR video during the iteration. Upon completion of this iteration, the latent video is decoded back to the conventional HR space.
  • Figure 2: A 64x64x10 Moving MNIST frame sequence, its 8x down-sampled version, and the super-resolved results using different number of frames.
  • Figure 3: PSNR v.s. number of used frames. The number of input frames gradually increases, and the PSNRs of the 1st frame are recorded. Each PSNR value in this plot is an average over 8 reference videos, and each video are restored 10 times with different noise instances.
  • Figure 4: A Moving MNIST frame sequence and its 8x down-sampled versions with blur kernel $\sigma_h = 0$ and $10$ pixels.
  • Figure 5: PSNR v.s. number of used frames. LR inputs are generated with blur kernel $\sigma_h = 0, 2, 4, 6, 8, 10$.
  • ...and 2 more figures