Table of Contents
Fetching ...

Solving Video Inverse Problems Using Image Diffusion Models

Taesung Kwon, Jong Chul Ye

TL;DR

This work tackles video inverse problems under spatio-temporal degradation by leveraging only pre-trained image diffusion models. It reframes the temporal axis as a batch dimension and introduces batch-consistent sampling combined with Krylov-subspace optimization (DDS) to perform spatio-temporal refinement within Tweedie-denoised batches, without training video diffusion models. The method achieves state-of-the-art reconstructions on temporal and spatio-temporal degradations while offering VRAM-efficient, faster-than-before performance, including capabilities at low NFEs and near real-time speeds for short sequences. It also demonstrates extensibility to blind and other restoration settings, highlighting practical impact for video restoration tasks with limited training data or resources.

Abstract

Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling strategy that encourages consistency across batches by synchronizing the stochastic noise components in image diffusion models. Our approach synergistically combines batch-consistent sampling with simultaneous optimization of denoised spatio-temporal batches at each reverse diffusion step, resulting in a novel and efficient diffusion sampling strategy for video inverse problems. Experimental results demonstrate that our method effectively addresses various spatio-temporal degradations in video inverse problems, achieving state-of-the-art reconstructions. Project page: https://svi-diffusion.github.io/

Solving Video Inverse Problems Using Image Diffusion Models

TL;DR

This work tackles video inverse problems under spatio-temporal degradation by leveraging only pre-trained image diffusion models. It reframes the temporal axis as a batch dimension and introduces batch-consistent sampling combined with Krylov-subspace optimization (DDS) to perform spatio-temporal refinement within Tweedie-denoised batches, without training video diffusion models. The method achieves state-of-the-art reconstructions on temporal and spatio-temporal degradations while offering VRAM-efficient, faster-than-before performance, including capabilities at low NFEs and near real-time speeds for short sequences. It also demonstrates extensibility to blind and other restoration settings, highlighting practical impact for video restoration tasks with limited training data or resources.

Abstract

Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling strategy that encourages consistency across batches by synchronizing the stochastic noise components in image diffusion models. Our approach synergistically combines batch-consistent sampling with simultaneous optimization of denoised spatio-temporal batches at each reverse diffusion step, resulting in a novel and efficient diffusion sampling strategy for video inverse problems. Experimental results demonstrate that our method effectively addresses various spatio-temporal degradations in video inverse problems, achieving state-of-the-art reconstructions. Project page: https://svi-diffusion.github.io/
Paper Structure (24 sections, 22 equations, 18 figures, 8 tables, 2 algorithms)

This paper contains 24 sections, 22 equations, 18 figures, 8 tables, 2 algorithms.

Figures (18)

  • Figure 1: Representative video reconstruction results for (a) Temporal degradation, (b) Temporal degradation + Deblurring combination, (c) Temporal degradation + Super-resolution combination, and (d) Temporal degradation + Inpainting combination.
  • Figure 2: Geometric illustration of the sampling path evolution. (a) Batch-independent sampling produces independent frames. (b) Batch-consistent sampling produces identical frames. (c) Batch-consistent sampling combined with frame-dependent perturbation through multi-step CG generates distinct frame satisfying spatio-temporal data consistency.
  • Figure 3: Sampling process in our video inverse problem solver. ${\bm{X}}_t$ is denoised to produce $\hat{{\bm{X}}}^b_t$ using 2D Tweedie formula, then reshaped into a video tensor. Multi-step CG in the video space, satisfying Eq. (\ref{['eq:CG']}), is applied to obtain $\bar{{\bm{X}}}_t$, which is then reshaped back into an image batch. Finally, ${\bm{X}}_{t-1}$ is sampled by adding noise $\hat{\boldsymbol{\mathcal{E}}}^b_t$.
  • Figure 4: Qualitative evaluation of temporal degradation tasks. 1$^{\text{st}}$ row: temporal ${\mathcal{A}}$ with uniform PSF with kernel width $k$ = 7. 2$^{\text{nd}}$ row: temporal ${\mathcal{A}}$ with Gaussian PSF with $\sigma$=1. Red and blue boxes indicate the enlarged views of the previous and next frames, respectively.
  • Figure 5: Qualitative evaluation of spatio-temporal degradation tasks. Each spatio-temporal degradation is combined with various spatial degradation tasks. 1$^{\text{st}}$ row: Deblurring ($\sigma$ = 2.0). 2$^{\text{nd}}$ row: SR ($\times$ 4). 3$^{\text{rd}}$ row: Inpainting ($r$ = 0.5).
  • ...and 13 more figures