Table of Contents
Fetching ...

DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

Liao Shen, Tianqi Liu, Huiqiang Sun, Xinyi Ye, Baopu Li, Jianming Zhang, Zhiguo Cao

TL;DR

This work proposes DreamMover, a novel image interpolation framework with three main components: a natural flow estimator based on the diffusion model that can implicitly reason about the semantic correspondence between two images, and a self-attention concatenation and replacement approach.

Abstract

We study the problem of generating intermediate images from image pairs with large motion while maintaining semantic consistency. Due to the large motion, the intermediate semantic information may be absent in input images. Existing methods either limit to small motion or focus on topologically similar objects, leading to artifacts and inconsistency in the interpolation results. To overcome this challenge, we delve into pre-trained image diffusion models for their capabilities in semantic cognition and representations, ensuring consistent expression of the absent intermediate semantic representations with the input. To this end, we propose DreamMover, a novel image interpolation framework with three main components: 1) A natural flow estimator based on the diffusion model that can implicitly reason about the semantic correspondence between two images. 2) To avoid the loss of detailed information during fusion, our key insight is to fuse information in two parts, high-level space and low-level space. 3) To enhance the consistency between the generated images and input, we propose the self-attention concatenation and replacement approach. Lastly, we present a challenging benchmark dataset InterpBench to evaluate the semantic consistency of generated results. Extensive experiments demonstrate the effectiveness of our method. Our project is available at https://dreamm0ver.github.io .

DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

TL;DR

This work proposes DreamMover, a novel image interpolation framework with three main components: a natural flow estimator based on the diffusion model that can implicitly reason about the semantic correspondence between two images, and a self-attention concatenation and replacement approach.

Abstract

We study the problem of generating intermediate images from image pairs with large motion while maintaining semantic consistency. Due to the large motion, the intermediate semantic information may be absent in input images. Existing methods either limit to small motion or focus on topologically similar objects, leading to artifacts and inconsistency in the interpolation results. To overcome this challenge, we delve into pre-trained image diffusion models for their capabilities in semantic cognition and representations, ensuring consistent expression of the absent intermediate semantic representations with the input. To this end, we propose DreamMover, a novel image interpolation framework with three main components: 1) A natural flow estimator based on the diffusion model that can implicitly reason about the semantic correspondence between two images. 2) To avoid the loss of detailed information during fusion, our key insight is to fuse information in two parts, high-level space and low-level space. 3) To enhance the consistency between the generated images and input, we propose the self-attention concatenation and replacement approach. Lastly, we present a challenging benchmark dataset InterpBench to evaluate the semantic consistency of generated results. Extensive experiments demonstrate the effectiveness of our method. Our project is available at https://dreamm0ver.github.io .
Paper Structure (16 sections, 11 equations, 9 figures, 3 tables)

This paper contains 16 sections, 11 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Given two input images with large motion, our proposed method can generate a short video with high fidelity and semantic consistency compared to previous approaches. To see the dynamic effect of our method, we encourage readers to watch our supplementary video.
  • Figure 2: Overview of our method. Given two input images $\mathcal{I}^0$ and $\mathcal{I}^1$, we extract feature maps and leverage them to obtain the bidirectional optical flow $F^{0\to1}$ and $F^{1\to0}$. Next, we decompose the noisy latent code $z_T$ into two-level space and perform softmax splatting and time interpolation for image fusion. For high-frequency information $\epsilon_\theta$, we replace all weighted average operations with "Winner-Takes-All"(WTA). In addition, we propose a novel self-attention replacement method for consistency. Finally, our method can generate a sequence of high-fidelity interpolation frames.
  • Figure 3: The potential of diffusion model for optical flow estimation. We perform PCA on the features and observe consistent spatial layouts with input images, and obtain bidirectional optical flow through the correspondence between feature maps.
  • Figure 4: The process of direct fusion and our proposed two-level fusion. Generally, $z_{T\to0}$ represents a latent code. Here, for clearer visualization, we illustrate the RGB image decoded from it to emphasize a significant loss of high-frequency information compared to input images.
  • Figure 5: (a) Effects of fusion in different space. Compared to direct fusion, our strategy better preserves details in the RGB image and maintains more high-frequency energy in the Fourier spectrograms. (b) Definition of high-frequency region. We define it as the part of the spectrogram beyond the centre 1/4. (c) High-frequency variations during denoising.
  • ...and 4 more figures