Table of Contents
Fetching ...

InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution

Jintong Hu, Bin Chen, Zhenyu Hu, Jiayue Liu, Guo Wang, Lu Qi

Abstract

Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K$\times$2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.

InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution

Abstract

Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.

Paper Structure

This paper contains 18 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Visual and quantitative comparison of video super-resolution methods on real-world content. Left: super-resolution results showing the low-quality input and the outputs of different methods, with zoomed-in patch comparisons. Right: comparison of computational efficiency (latency, parameters, and VRAM) and visual quality metrics. Our method achieves a strong trade-off between efficiency and perceptual quality compared with RealViformer zhang2024realviformer and SeedVR2 wang2025seedvr2.
  • Figure 2: The main pipeline of GaussianSR. GaussianSR begins with an encoder that extracts feature representations from the input image, followed by Selective Gaussian Splatting which assigns a learnable Gaussian kernel to each pixel, converting dicrete feature points into Gaussian fields. Features at any arbitrary query point $x_{q}$ in the plane are computed using the overlapping Gaussian functions that modulate their influence based on the spatial location. Finally, these continuous-domain features are rendered into a high-resolution space and refined through the decoder to reconstruct the desired RGB output at the specified query coordinates.
  • Figure 3: Recurrent VSR Training with Progressive Super Resolution and Temporal Loss. Center frames are progressively super-resolved via 5-frame sliding window optical flow alignment, with SR frames replacing LR frames for continuous buffer update. Cyclic inference generates second-order SR results, and temporal loss between adjacent SR frames enforces recurrent training consistency.
  • Figure 4: Qualitative comparisons on real-world videos. Our method is capable of generating more realistic and fine-grained details. Compared to existing methods, it notably excels in restoration quality: for the stone-structured building (first row), our method produces more delicate and authentic stone textures; for the red bricks and leaves (second row), our method generates well-aligned and structurally regular patterns that better conform to human visual preferences. (Zoom-in for best view).
  • Figure 5: Temporal Profiles of Video Reconstruction: Comparison of Our Method with Baselines. Our method yields smooth profiles with well-preserved structural textures, especially straight architectural lines. In contrast, competing methods either remove these features or introduce geometric distortions, producing jagged patterns. This highlights our method's advantage in maintaining both temporal consistency and geometric fidelity. (Zoom-in for best view).
  • ...and 2 more figures