VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models
Taesung Kwon, Jong Chul Ye
TL;DR
VISION-XL tackles high-definition (HD) video inverse problems by deploying latent diffusion priors in a novel pipeline that uses pseudo-batch consistent sampling to process multi-frame data with memory for a single frame and pseudo-batch inversion to seed informative latents from the measurements. The method integrates frame-wise inverse encoding, parallel latent denoising, l-step data-consistency optimization, scheduled low-pass filtering, and re-denoising, achieving state-of-the-art reconstruction across deblurring, super-resolution, and inpainting tasks while supporting landscape, vertical, and square aspect ratios and delivering HD outputs in under 6 seconds per frame on a single NVIDIA $4090$ GPU with Stable Diffusion XL. Ablation studies demonstrate the importance of informative initialization, CG iterations, and frequency-domain filtering for temporal coherence and artifact suppression, with significant gains in FVD and PSNR. The approach extends to blind video inverse problems and provides rich visualizations and supplementary material, highlighting practical utility for real-world HD video restoration with scalable computational requirements.
Abstract
In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Building on recent advancements in spatio-temporal optimization for video inverse problems using image diffusion models, our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution. To address the high computational demands of processing high-resolution frames, we introduce a pseudo-batch consistent sampling strategy, allowing efficient operation on a single GPU. Additionally, to improve temporal consistency, we present pseudo-batch inversion, an initialization technique that incorporates informative latents from the measurement. By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting. Unlike previous methods, our approach supports multiple aspect ratios (landscape, vertical, and square) and delivers HD-resolution reconstructions (exceeding 1280x720) in under 6 seconds per frame on a single NVIDIA 4090 GPU.
