VideoGigaGAN: Towards Detail-rich Video Super-Resolution
Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, Difan Liu
TL;DR
<3-5 sentence high-level summary>VideoGigaGAN introduces a large-scale GAN-based approach to video super-resolution by extending GigaGAN with temporal modules, flow-guided feature propagation, anti-aliasing blocks, and a high-frequency shuttle to achieve detailed, temporally coherent 8× SR. The method addresses the consistency-quality dilemma in VSR by integrating temporal alignment and high-frequency detail pathways within a single feed-forward model. Extensive experiments on REDS4 and Vimeo-90K-T show improved perceptual detail (LPIPS) and competitive temporal consistency, including 8× upsampling, while outlining trade-offs with PSNR/SSIM metrics. Limitations include challenges with very long sequences and small objects, suggesting directions for future work.
Abstract
Video super-resolution (VSR) approaches have shown impressive temporal consistency in upsampled videos. However, these approaches tend to generate blurrier results than their image counterparts as they are limited in their generative capability. This raises a fundamental question: can we extend the success of a generative image upsampler to the VSR task while preserving the temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that can produce videos with high-frequency details and temporal consistency. VideoGigaGAN builds upon a large-scale image upsampler -- GigaGAN. Simply inflating GigaGAN to a video model by adding temporal modules produces severe temporal flickering. We identify several key issues and propose techniques that significantly improve the temporal consistency of upsampled videos. Our experiments show that, unlike previous VSR methods, VideoGigaGAN generates temporally consistent videos with more fine-grained appearance details. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with $8\times$ super-resolution.
