Table of Contents
Fetching ...

Frame Interpolation with Consecutive Brownian Bridge Diffusion

Zonglin Lyu, Ming Li, Jianbo Jiao, Chen Chen

TL;DR

This work proposes consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations, leaving strong potential for further enhancement in VFI.

Abstract

Recent work in Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem, synthesizing the intermediate frame given a random noise and neighboring frames. Due to the relatively high resolution of videos, Latent Diffusion Models (LDMs) are employed as the conditional generation model, where the autoencoder compresses images into latent representations for diffusion and then reconstructs images from these latent representations. Such a formulation poses a crucial challenge: VFI expects that the output is deterministically equal to the ground truth intermediate frame, but LDMs randomly generate a diverse set of different images when the model runs multiple times. The reason for the diverse generation is that the cumulative variance (variance accumulated at each step of generation) of generated latent representations in LDMs is large. This makes the sampling trajectory random, resulting in diverse rather than deterministic generations. To address this problem, we propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion. Specifically, we propose consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations. Our experiments suggest that our method can improve together with the improvement of the autoencoder and achieve state-of-the-art performance in VFI, leaving strong potential for further enhancement.

Frame Interpolation with Consecutive Brownian Bridge Diffusion

TL;DR

This work proposes consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations, leaving strong potential for further enhancement in VFI.

Abstract

Recent work in Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem, synthesizing the intermediate frame given a random noise and neighboring frames. Due to the relatively high resolution of videos, Latent Diffusion Models (LDMs) are employed as the conditional generation model, where the autoencoder compresses images into latent representations for diffusion and then reconstructs images from these latent representations. Such a formulation poses a crucial challenge: VFI expects that the output is deterministically equal to the ground truth intermediate frame, but LDMs randomly generate a diverse set of different images when the model runs multiple times. The reason for the diverse generation is that the cumulative variance (variance accumulated at each step of generation) of generated latent representations in LDMs is large. This makes the sampling trajectory random, resulting in diverse rather than deterministic generations. To address this problem, we propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion. Specifically, we propose consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations. Our experiments suggest that our method can improve together with the improvement of the autoencoder and achieve state-of-the-art performance in VFI, leaving strong potential for further enhancement.
Paper Structure (23 sections, 33 equations, 9 figures, 4 tables, 2 algorithms)

This paper contains 23 sections, 33 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: The illustration of our two-stage method. The encoder is shared for all frames. (a) The autoencoder stage. In this stage, previous frame $I_0$, intermediate frame $I_n$, and next frame $I_1$ are encoded by the encoder to $\mathbf{y},\mathbf{x},\mathbf{z}$ respectively. Then $\mathbf{x}$ is fed to the decoder, together with the encoder feature of $I_0,I_1$ at different down-sampling factors. The decoder predicts the intermediate frame as $\hat{I}_n$. The encoder and decoder are trained in this stage. (b) The ground truth estimation stage. In this stage, $\mathbf{y},\mathbf{x},\mathbf{z}$ will be fed to the consecutive Brownian Bridge diffusion as three endpoints, where we sample two states that move time step $s$ from $\mathbf{x}$ in both directions. The UNet predicts the difference between the current state and $\mathbf{x}$. The autoencoder is well-trained and frozen in this stage. (c) Inference. $\hat{\mathbf{x}}$ is sampled from $\mathbf{y},\mathbf{z}$ to estimate $\mathbf{x}$ (details in Section \ref{['sec: consecutive BB']}). The decoder receives $\hat{\mathbf{x}}$ and encoder features of $I_0,I_1$ at different down-sampling factors to interpolate the intermediate frame.
  • Figure 2: Architecture of the autoencoder. The encoder is in green dashed boxes, and the decoder contains all remaining parts. The output of consecutive Brownian Bridge diffusion will be fed to the VQ layer. The features of $I_0,I_1$ at different down-sampling rate will be sent to the cross-attention module at Up Sample Block in the Decoder.
  • Figure 3: The reconstruction quality of our autoencoder and LDMVFI's autoencoder (decoding with ground truth latent representation x). Images are cropped with green boxes for detailed comparisons. Red circles highlight the details where our method achieves better performance. LDMVFI usually outputs overlaid images while our method does not.
  • Figure 4: The visual comparison of interpolated results of LDMVFI danier2023ldmvfi vs our method with the same autoencoder in LDMVFI (LDMVFI vs our$\dagger$ in Table \ref{['tab:results']}). With the same autoencoder, our method can still achieve better visual quality than LDMVFI, demonstrating the superiority of our proposed consecutive Brownian Bridge diffusion.
  • Figure 5: Visual illustration of the inconsistency between PSNR/SSIM and visual quality. Only images cropped within blue boxes are evaluated with PSNR/SSIM. The red circles highlight our visual quality. Our method generates images with better visual quality, but the PSNR/SSIM is much lower.
  • ...and 4 more figures