Table of Contents
Fetching ...

VideoGigaGAN: Towards Detail-rich Video Super-Resolution

Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, Difan Liu

TL;DR

<3-5 sentence high-level summary>VideoGigaGAN introduces a large-scale GAN-based approach to video super-resolution by extending GigaGAN with temporal modules, flow-guided feature propagation, anti-aliasing blocks, and a high-frequency shuttle to achieve detailed, temporally coherent 8× SR. The method addresses the consistency-quality dilemma in VSR by integrating temporal alignment and high-frequency detail pathways within a single feed-forward model. Extensive experiments on REDS4 and Vimeo-90K-T show improved perceptual detail (LPIPS) and competitive temporal consistency, including 8× upsampling, while outlining trade-offs with PSNR/SSIM metrics. Limitations include challenges with very long sequences and small objects, suggesting directions for future work.

Abstract

Video super-resolution (VSR) approaches have shown impressive temporal consistency in upsampled videos. However, these approaches tend to generate blurrier results than their image counterparts as they are limited in their generative capability. This raises a fundamental question: can we extend the success of a generative image upsampler to the VSR task while preserving the temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that can produce videos with high-frequency details and temporal consistency. VideoGigaGAN builds upon a large-scale image upsampler -- GigaGAN. Simply inflating GigaGAN to a video model by adding temporal modules produces severe temporal flickering. We identify several key issues and propose techniques that significantly improve the temporal consistency of upsampled videos. Our experiments show that, unlike previous VSR methods, VideoGigaGAN generates temporally consistent videos with more fine-grained appearance details. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with $8\times$ super-resolution.

VideoGigaGAN: Towards Detail-rich Video Super-Resolution

TL;DR

<3-5 sentence high-level summary>VideoGigaGAN introduces a large-scale GAN-based approach to video super-resolution by extending GigaGAN with temporal modules, flow-guided feature propagation, anti-aliasing blocks, and a high-frequency shuttle to achieve detailed, temporally coherent 8× SR. The method addresses the consistency-quality dilemma in VSR by integrating temporal alignment and high-frequency detail pathways within a single feed-forward model. Extensive experiments on REDS4 and Vimeo-90K-T show improved perceptual detail (LPIPS) and competitive temporal consistency, including 8× upsampling, while outlining trade-offs with PSNR/SSIM metrics. Limitations include challenges with very long sequences and small objects, suggesting directions for future work.

Abstract

Video super-resolution (VSR) approaches have shown impressive temporal consistency in upsampled videos. However, these approaches tend to generate blurrier results than their image counterparts as they are limited in their generative capability. This raises a fundamental question: can we extend the success of a generative image upsampler to the VSR task while preserving the temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that can produce videos with high-frequency details and temporal consistency. VideoGigaGAN builds upon a large-scale image upsampler -- GigaGAN. Simply inflating GigaGAN to a video model by adding temporal modules produces severe temporal flickering. We identify several key issues and propose techniques that significantly improve the temporal consistency of upsampled videos. Our experiments show that, unlike previous VSR methods, VideoGigaGAN generates temporally consistent videos with more fine-grained appearance details. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with super-resolution.
Paper Structure (19 sections, 4 equations, 7 figures, 4 tables)

This paper contains 19 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We present VideoGigaGAN, a generative video super-resolution model that can upsample videos with high-frequency details while maintaining temporal consistency. Top: we show the comparison of our approach with TTVSR liu2022learning and BasicVSR++ chan2022basicvsrpp. Our method produces temporally consistent videos with more fine-grained detailed than previous methods. Bottom: our model can produce high-quality videos with $8 \times$ super-resolution. Please see the video results on our https://videogigagan.github.io/.
  • Figure 2: Limitations of previous methods. Previous VSR approaches such as BasicVSR++ chan2022basicvsrpp suffer from lack of details, as seen from the car example. Image GigaGAN produces sharper results with richer details, but it generates videos with temporal flickering and artifacts like aliasing (see building). Our VideoGigaGAN can produce video results with both high-frequency details and temporal consistency while artifacts like aliasing are significantly mitigated.
  • Figure 3: Overview of our method for $4\times$ upsampling. Our Video Super-Resolution (VSR) model is built upon the asymmetric U-Net architecture of the image GigaGAN upsampler kang2023gigagan. To enforce temporal consistency, we first inflate the image upsampler into a video upsampler by adding temporal attention layers into the decoder blocks. We also enhance consistency by incorporating the features from the flow-guided propagation module. To suppress aliasing artifacts, we use Anti-aliasing block in the downsampling layers of the encoder. Lastly, we directly shuttle the high frequency features via skip connection to the decoder layers to compensate for the loss of details in the BlurPool process.
  • Figure 4: Qualitative comparison with other baselines on public datasets (REDS4 Nah2019reds4, Vimeo-90K-T xue2019toflow. We show PSNR/LPIPS below each output frame. PSNR does not align well with human perception and favor blurry results. LPIPS is a preferred metric that aligns better with human perception. Compared to previous VSR approaches, our model can produce more realistic textures and more fine-grained details.
  • Figure 5: Ablation study. Starting from the inflated GigaGAN (+Temporal attention in the figure), we progressively add components to demonstrate its effectiveness. With temporal attention, the local temporal consistency is improved compared to using image GigaGAN to upsample each frame independently. The global temporal consistency improves with feature propagation, but aliasing still exists in the areas with high-frequency details (please refer to the videos in the https://videogigagan.github.io/). Also, the video results become more blurry. By using the anti-aliasing blocks -- BlurPool, the aliasing issue is much better, but the video results become even more blurry. Finally, with HF shuttle, we can bring the per-frame quality and high-frequency details back while preserving good temporal consistency.
  • ...and 2 more figures