Table of Contents
Fetching ...

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, Chen Change Loy

TL;DR

Upscale-A-Video tackles real-world VSR by coupling a local-global temporal strategy with a latent diffusion prior. Local consistency is achieved by finetuning temporal U-Net and VAE-Decoder components, while global coherence is enforced via a training-free flow-guided recurrent latent propagation module that operates across video segments. The framework supports text-guided texture synthesis and adjustable noise levels to balance restoration and generation, offering a flexible fidelity-quality trade-off. Empirical results on synthetic, real-world, and AI-generated videos demonstrate superior temporal stability and perceptual realism, establishing a strong practical baseline for diffusion-based VSR in real-world conditions.

Abstract

Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

TL;DR

Upscale-A-Video tackles real-world VSR by coupling a local-global temporal strategy with a latent diffusion prior. Local consistency is achieved by finetuning temporal U-Net and VAE-Decoder components, while global coherence is enforced via a training-free flow-guided recurrent latent propagation module that operates across video segments. The framework supports text-guided texture synthesis and adjustable noise levels to balance restoration and generation, offering a flexible fidelity-quality trade-off. Empirical results on synthetic, real-world, and AI-generated videos demonstrate superior temporal stability and perceptual realism, establishing a strong practical baseline for diffusion-based VSR in real-world conditions.

Abstract

Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.
Paper Structure (27 sections, 3 equations, 16 figures, 5 tables)

This paper contains 27 sections, 3 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Video super-resolution comparisons on both real-world and AI-generated videos. Our proposed Upscale-A-Video showcases excellent upscaling capabilities. By using appropriate text prompts, it achieves impressive results characterized by finer details and heightened visual realism. (Zoom-in for best view)
  • Figure 2: An overview of Upscale-A-Video. Upscale-A-Video processes long videos using both local and global strategies to maintain temporal coherence. It divides the video into segments and processes them using a U-Net with temporal layers for intra-segment consistency. During user-specified diffusion steps for global refinement, a recurrent latent propagation module is used to enhance inter-segment consistency. Finally, a finetuned VAE-Decoder reduces remaining flickering artifacts for low-level consistency. Our model also allows users to guide texture creation with text prompts and adjust noise levels to balance the effect of restoration and generation.
  • Figure 3: An illustration of flow-guided recurrent latent propagation. Without requiring any learning, this module can achieve coherence across video segments via long-term latent propagation and aggregation. It relies on optical flow validity determined by forward-backward consistency error meister2018unflow. Only latent positions with low consistency errors will be propagated, while those with high errors, marked with a red dot, are not.
  • Figure 4: Qualitative comparisons on synthetic low-quality videos from REDS30 nah2019ntire and YouHQ40 datasets. Among the tested methods, only our Upscale-A-Video can recover the accurate wall structure and produce detailed koala fur. (Zoom-in for best view)
  • Figure 5: Qualitative comparisons on real-world test videos in VideoLQ chan2022investigating dataset. Our Upscale-A-Video effectively leverages the advantages of the diffusion paradigm in generating high-quality results. When compared to existing methods, it notably excels in its restoration capabilities, successfully recovering the billboard word "EAT IN or TAKEAWAY". In particular, when guided by text prompts, Upscale-A-Video showcases promising enhanced results with more details and heightened realism. (Zoom-in for best view)
  • ...and 11 more figures