Table of Contents
Fetching ...

InfVSR: Breaking Length Limits of Generic Video Super-Resolution

Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, Yulun Zhang

TL;DR

InfVSR tackles the challenge of long-form video super-resolution by reframing VSR as autoregressive-one-step-diffusion (AR-OSD), enabling streaming inference over unbounded-length sequences. It advances a causal DiT backbone with rolling KV-cache and joint visual guidance, coupled with patch-wise pixel supervision and cross-chunk distribution matching to distill diffusion into a single step per chunk. A two-stage curriculum and a new MovieLQ benchmark with semantic-level temporal metrics support training and evaluation, achieving state-of-the-art fidelity and temporal coherence while delivering up to 58x speed-ups over prior diffusion-based methods. This approach significantly improves scalability and practicality for real-world long videos, opening the path for deployable, high-quality long-range VSR systems.

Abstract

Real-world videos often extend over thousands of frames. Existing video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor scalability hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which novelly reformulates VSR as an autoregressive-one-step-diffusion paradigm. This enables streaming inference while fully leveraging pre-trained video diffusion priors. First, we adapt the pre-trained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. Together, these designs enable efficient and scalable VSR for unbounded-length videos. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Code will be available at https://github.com/Kai-Liu001/InfVSR.

InfVSR: Breaking Length Limits of Generic Video Super-Resolution

TL;DR

InfVSR tackles the challenge of long-form video super-resolution by reframing VSR as autoregressive-one-step-diffusion (AR-OSD), enabling streaming inference over unbounded-length sequences. It advances a causal DiT backbone with rolling KV-cache and joint visual guidance, coupled with patch-wise pixel supervision and cross-chunk distribution matching to distill diffusion into a single step per chunk. A two-stage curriculum and a new MovieLQ benchmark with semantic-level temporal metrics support training and evaluation, achieving state-of-the-art fidelity and temporal coherence while delivering up to 58x speed-ups over prior diffusion-based methods. This approach significantly improves scalability and practicality for real-world long videos, opening the path for deployable, high-quality long-range VSR systems.

Abstract

Real-world videos often extend over thousands of frames. Existing video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor scalability hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which novelly reformulates VSR as an autoregressive-one-step-diffusion paradigm. This enables streaming inference while fully leveraging pre-trained video diffusion priors. First, we adapt the pre-trained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. Together, these designs enable efficient and scalable VSR for unbounded-length videos. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Code will be available at https://github.com/Kai-Liu001/InfVSR.

Paper Structure

This paper contains 12 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Speed and multi-frame comparisons. Our InfVSR is capable to seamlessly and streamingly upscale videos with unbounded length, and demonstrates the best quality and fastest speed among existing diffusion-based methods. Compared with MGLD-VSR mgldvsr, it is 58$\times$ faster.
  • Figure 2: Overview of the framework and training strategy of InfVSR. Our method combines intra-chunk one-step diffusion with inter-chunk autoregression for efficient and scalable VSR. AR is supported by local KV-cache and global joint visual guidance. To enable effective and efficient training, we adopt two objectives: (1) patch-wise pixel supervision, which guides detail reconstruction with significantly reduced memory decoding through random spatial cropping; and (2) cross-chunk distribution matching, which enforces high-level consistency with a pretrained and a finetuned regularizer, following vsddmdosediff.
  • Figure 3: Illustration of our patch-wise pixel supervision.
  • Figure 4: Visual comparison on SPMCS spmcs and VideoLQ realbasicvsrvideolq.
  • Figure 5: Comparison of temporal profile of SOTA methods (stacking the red line across frames).
  • ...and 1 more figures