Table of Contents
Fetching ...

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai

TL;DR

STAR addresses real-world video super-resolution by integrating text-to-video diffusion priors with a dedicated Local Information Enhancement Module and a Dynamic Frequency Loss. The LIEM injects local detail processing before global attention to reduce artifacts, while the DF Loss decouples fidelity across diffusion steps by emphasizing low-frequency structure early and high-frequency details later, improving both spatial fidelity and temporal consistency. Empirical results on synthetic and real-world datasets show STAR achieving state-of-the-art performance in key metrics, with additional gains when scaling to larger T2V models like CogVideoX. This work demonstrates the practical viability of powerful T2V priors for real-world VSR and suggests further improvements with larger diffusion priors and more diverse data.

Abstract

Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

TL;DR

STAR addresses real-world video super-resolution by integrating text-to-video diffusion priors with a dedicated Local Information Enhancement Module and a Dynamic Frequency Loss. The LIEM injects local detail processing before global attention to reduce artifacts, while the DF Loss decouples fidelity across diffusion steps by emphasizing low-frequency structure early and high-frequency details later, improving both spatial fidelity and temporal consistency. Empirical results on synthetic and real-world datasets show STAR achieving state-of-the-art performance in key metrics, with additional gains when scaling to larger T2V models like CogVideoX. This work demonstrates the practical viability of powerful T2V priors for real-world VSR and suggests further improvements with larger diffusion priors and more diverse data.

Abstract

Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.
Paper Structure (25 sections, 10 equations, 15 figures, 7 tables)

This paper contains 25 sections, 10 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Visualization comparisons on both real-world and synthetic low-resolution videos. Compared to the state-of-the-art VSR models zhang2024realviformerzhou2024upscale, our results demonstrate more natural facial details and better structure of the text. (Zoom-in for best view)
  • Figure 2: Overview of the proposed STAR.
  • Figure 3: Motivation of LIEM.Left: schematic diagram illustrating the impact of using only global structure versus a combination of local and global structures. Right: visual comparison on real-world and synthetic videos. (Zoom-in for best view)
  • Figure 4: Motivation of DF Loss.Left: PSNR curves of low- and high-frequency components relative to ground truth across diffusion steps. The low-frequency PSNR increases during the early diffusion steps, while the high-frequency PSNR rises in the later diffusion steps. Right: visual results of low- and high-frequency components at different diffusion stage. (Zoom-in for best view)
  • Figure 5: Dynamic Frequency Loss.Left: curves of weighting function $c(t)$ for different $\alpha$. Right: details of DF loss.
  • ...and 10 more figures