Table of Contents
Fetching ...

SkipSR: Faster Super Resolution with Token Skipping

Rohan Choudhury, Shanchuan Lin, Jianyi Wang, Hao Chen, Qi Zhao, Feng Cheng, Lu Jiang, Kris Kitani, Laszlo A. Jeni

TL;DR

SkipSR tackles the scalability challenge of diffusion-based video SR by predicting and skipping low-detail patches from the low-resolution input, routing only the complex patches through the Transformer and combining results with fast upsampling. The method leverages a lightweight mask predictor in the latent space and RoPE-adapted position handling to maintain consistency when some patches bypass the transformer, enabling substantial wall-clock speedups without perceptual quality loss. It unifies a skip-aware diffusion pathway with training strategies including one-step distillation and adversarial post-training, and demonstrates up to 60% reductions in end-to-end latency on 720p video SR (and up to 70% diffusion-time reductions on 1080p) while matching SeedVR/SeedVR2 quality on real-world and AI-generated data. The work offers a practical approach to accelerate high-resolution video SR, making diffusion-based restoration and generation more scalable for longer sequences and higher resolutions, with clear speed-quality tradeoffs governed by a tunable threshold $\tau$.

Abstract

Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/

SkipSR: Faster Super Resolution with Token Skipping

TL;DR

SkipSR tackles the scalability challenge of diffusion-based video SR by predicting and skipping low-detail patches from the low-resolution input, routing only the complex patches through the Transformer and combining results with fast upsampling. The method leverages a lightweight mask predictor in the latent space and RoPE-adapted position handling to maintain consistency when some patches bypass the transformer, enabling substantial wall-clock speedups without perceptual quality loss. It unifies a skip-aware diffusion pathway with training strategies including one-step distillation and adversarial post-training, and demonstrates up to 60% reductions in end-to-end latency on 720p video SR (and up to 70% diffusion-time reductions on 1080p) while matching SeedVR/SeedVR2 quality on real-world and AI-generated data. The work offers a practical approach to accelerate high-resolution video SR, making diffusion-based restoration and generation more scalable for longer sequences and higher resolutions, with clear speed-quality tradeoffs governed by a tunable threshold .

Abstract

Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/

Paper Structure

This paper contains 27 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Patch Skipping.Standard diffusion SR models refine the entire input, while SkipSR identifies and upscales only the patches that need refinement. This significantly reduces computation with no loss in perceptual quality.
  • Figure 2: Oracle Mask Computation. We identify the low-resolution video regions by comparing the original high-resolution input to a spatial downsampled, then upsampled version. This results in a attention mask, shown in dark, that excludes these regions from refinement.
  • Figure 3: SkipSR Overview. We take as input low-resolution videos, project into latent space, then compute the complexity mask. Simple patches skip computation and are routed around the transformer, then composed with the refined output.
  • Figure 4: Visual Comparison. In the above two examples, the ground truth is on top, followed by the predicted mask, and our output. We produce perceptually indistinguishable results while only refining a small subset of the input video. Additional results are on our https://clamsoup97.github.io/anonymous-projects/skipsr/.
  • Figure :
  • ...and 3 more figures