InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
Weimin Bai, Suzhe Xu, Yiwei Ren, Jinhua Hao, Ming Sun, Wenzheng Chen, He Sun
TL;DR
InstantViR addresses the challenge of real-time video inverse problems by distilling a powerful, pre-trained video diffusion prior into a one-shot, causal autoregressive solver. It formulates amortized inference in latent space with a data-fidelity term and a prior-distillation term against a frozen diffusion teacher, eliminating the need for paired training data. A block-wise, causally-attentive architecture with KV caching and a latent-space adaptation via LeanVAE enables high-speed streaming (over 35 FPS at 832×480 and up to 100× faster than iterative solvers) while preserving diffusion-level temporal coherence. The framework supports random inpainting, Gaussian deblurring, 4× super-resolution, and text-guided reconstruction/editing, making diffusion-based video restoration practical for live, editable, and streaming applications.
Abstract
Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.
