Table of Contents
Fetching ...

VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models

Taesung Kwon, Jong Chul Ye

TL;DR

VISION-XL tackles high-definition (HD) video inverse problems by deploying latent diffusion priors in a novel pipeline that uses pseudo-batch consistent sampling to process multi-frame data with memory for a single frame and pseudo-batch inversion to seed informative latents from the measurements. The method integrates frame-wise inverse encoding, parallel latent denoising, l-step data-consistency optimization, scheduled low-pass filtering, and re-denoising, achieving state-of-the-art reconstruction across deblurring, super-resolution, and inpainting tasks while supporting landscape, vertical, and square aspect ratios and delivering HD outputs in under 6 seconds per frame on a single NVIDIA $4090$ GPU with Stable Diffusion XL. Ablation studies demonstrate the importance of informative initialization, CG iterations, and frequency-domain filtering for temporal coherence and artifact suppression, with significant gains in FVD and PSNR. The approach extends to blind video inverse problems and provides rich visualizations and supplementary material, highlighting practical utility for real-world HD video restoration with scalable computational requirements.

Abstract

In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Building on recent advancements in spatio-temporal optimization for video inverse problems using image diffusion models, our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution. To address the high computational demands of processing high-resolution frames, we introduce a pseudo-batch consistent sampling strategy, allowing efficient operation on a single GPU. Additionally, to improve temporal consistency, we present pseudo-batch inversion, an initialization technique that incorporates informative latents from the measurement. By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting. Unlike previous methods, our approach supports multiple aspect ratios (landscape, vertical, and square) and delivers HD-resolution reconstructions (exceeding 1280x720) in under 6 seconds per frame on a single NVIDIA 4090 GPU.

VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models

TL;DR

VISION-XL tackles high-definition (HD) video inverse problems by deploying latent diffusion priors in a novel pipeline that uses pseudo-batch consistent sampling to process multi-frame data with memory for a single frame and pseudo-batch inversion to seed informative latents from the measurements. The method integrates frame-wise inverse encoding, parallel latent denoising, l-step data-consistency optimization, scheduled low-pass filtering, and re-denoising, achieving state-of-the-art reconstruction across deblurring, super-resolution, and inpainting tasks while supporting landscape, vertical, and square aspect ratios and delivering HD outputs in under 6 seconds per frame on a single NVIDIA GPU with Stable Diffusion XL. Ablation studies demonstrate the importance of informative initialization, CG iterations, and frequency-domain filtering for temporal coherence and artifact suppression, with significant gains in FVD and PSNR. The approach extends to blind video inverse problems and provides rich visualizations and supplementary material, highlighting practical utility for real-world HD video restoration with scalable computational requirements.

Abstract

In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Building on recent advancements in spatio-temporal optimization for video inverse problems using image diffusion models, our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution. To address the high computational demands of processing high-resolution frames, we introduce a pseudo-batch consistent sampling strategy, allowing efficient operation on a single GPU. Additionally, to improve temporal consistency, we present pseudo-batch inversion, an initialization technique that incorporates informative latents from the measurement. By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting. Unlike previous methods, our approach supports multiple aspect ratios (landscape, vertical, and square) and delivers HD-resolution reconstructions (exceeding 1280x720) in under 6 seconds per frame on a single NVIDIA 4090 GPU.

Paper Structure

This paper contains 12 sections, 9 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: Representative video reconstruction by VISION-XL: SR+ (frame averaging with $\times$4 super-resolution, top), Deblur+ (frame averaging with deblurring, $\sigma$=3.0, bottom-left), and Inpaint+ (frame averaging with 50% random inpainting, bottom-right).
  • Figure 2: Illustration of VISION-XL sampling at timestep $t$: ${\boldsymbol{z}}_t$ is split into individual frames and denoised in parallel using Tweedie’s formula. The denoised latents $\hat{{\boldsymbol{z}}}_t$ are then merged and decoded. The decoded batch $\hat{{\boldsymbol{X}}}_t$ is optimized to enforce the data consistency, followed by low-pass filtered encoding and re-noising to obtain ${\boldsymbol{z}}_{t-1}$.
  • Figure 3: Geometric illustration of the sampling path evolution. (Step 1) Initialize latent ${\boldsymbol{z}}_{\tau}$. (Step 2) Project onto $\mathcal{M}_0$ via pseudo-batch sampling and decode to pixel space. (Step 3) Optimize for measurement consistency ${\boldsymbol{Y}}={\mathcal{A}}({\boldsymbol{X}})$. (Step 4) Apply a scheduled low-pass filter and encode back to latent space. (Step 5) Renoise to $\mathcal{M}_{\tau-1}$.
  • Figure 4: Qualitative evaluation of solving spatio-temporal inverse problems across DAVIS, Pexels dataset with multiple aspect ratios. Notably, ADMM-TV fails to remove ghosting artifacts caused by temporal degradation (red arrows), while SVI produces excessive intensity fluctuations (red box) or blurred information restoration (green and blue boxes).
  • Figure 5: Qualitative evaluation of SR ($\times$4) performance across multiple aspect ratios (landscape, vertical). DiffIR2VR often produces unwanted artifacts in the background (red and blue boxes), while SVI inaccurately restores intensity (red box), leading to frame-wise fluctuations.
  • ...and 3 more figures