Table of Contents
Fetching ...

InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior

Weimin Bai, Suzhe Xu, Yiwei Ren, Jinhua Hao, Ming Sun, Wenzheng Chen, He Sun

TL;DR

InstantViR addresses the challenge of real-time video inverse problems by distilling a powerful, pre-trained video diffusion prior into a one-shot, causal autoregressive solver. It formulates amortized inference in latent space with a data-fidelity term and a prior-distillation term against a frozen diffusion teacher, eliminating the need for paired training data. A block-wise, causally-attentive architecture with KV caching and a latent-space adaptation via LeanVAE enables high-speed streaming (over 35 FPS at 832×480 and up to 100× faster than iterative solvers) while preserving diffusion-level temporal coherence. The framework supports random inpainting, Gaussian deblurring, 4× super-resolution, and text-guided reconstruction/editing, making diffusion-based video restoration practical for live, editable, and streaming applications.

Abstract

Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.

InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior

TL;DR

InstantViR addresses the challenge of real-time video inverse problems by distilling a powerful, pre-trained video diffusion prior into a one-shot, causal autoregressive solver. It formulates amortized inference in latent space with a data-fidelity term and a prior-distillation term against a frozen diffusion teacher, eliminating the need for paired training data. A block-wise, causally-attentive architecture with KV caching and a latent-space adaptation via LeanVAE enables high-speed streaming (over 35 FPS at 832×480 and up to 100× faster than iterative solvers) while preserving diffusion-level temporal coherence. The framework supports random inpainting, Gaussian deblurring, 4× super-resolution, and text-guided reconstruction/editing, making diffusion-based video restoration practical for live, editable, and streaming applications.

Abstract

Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.

Paper Structure

This paper contains 34 sections, 10 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: We introduce InstantViR, a real-time video inverse problem solver that drastically outperforms slow sampling-based methods in both speed and quality. Bottom-right: At 832$\times$480 resolution, our amortized framework is over 100$\times$ faster than sampling-based baselines like SVI kwon2025solving, achieving over 35 FPS and the excellent quality. Left and Bottom-left: Qualitative examples demonstrate versatile, high-fidelity reconstruction for inpainting and deblurring, along with optional text-guided control (e.g., "pink lips", "light-blue collar").
  • Figure 1: Video Inpainting qualitative comparison. Each row shows a complete sequence reconstructed by a specific method. InstantViR (both WanVAE Ours and LeanVAE Ours$^\dag$ variants) produces coherent content for every frame while requiring only a single feed-forward pass.
  • Figure 2: Overview of the InstantViR framework.(Top) Training: We train a single-step solver $q_\phi$ using only degraded measurements $\boldsymbol{y}$. The solver is optimized with two objectives: a data fidelity loss (ensuring the reconstruction matches the measurement $\boldsymbol{y}$) and a prior distillation loss (using a frozen video diffusion prior wan2.1, $s_\theta$, to enforce temporal consistency and realism). (Bottom) Inference: The trained solver $q_\phi$ operates as a fast, feed-forward network, processing the video in a causal, block-wise, and autoregressive manner. This enables real-time, single-step streaming reconstruction with optional text guidance.
  • Figure 2: Video Deblurring qualitative comparison. Rows correspond to different methods; columns show consecutive frames covering the entire clip. InstantViR (both WanVAE Ours and LeanVAE Ours$^\dag$ variants) restores fine structures consistently across time, outperforming slower diffusion-based baselines.
  • Figure 3: Qualitative comparison for video random inpainting.(Top) On the Open-Sora dataset lin2024open, our model reconstructs a high-fidelity and temporally consistent video from a 50% masked measurement. The zoomed-in face sequence demonstrates the stability and fine detail of our single-step result. (Bottom) We demonstrate strong zero-shot generalization on the REDS dataset zhang2018unreasonable, our method generates a sharp, coherent video that is perceptually close to the ground truth.
  • ...and 5 more figures