Table of Contents
Fetching ...

Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation

Tserendorj Adiya, Jae Shin Yoon, Jungeun Lee, Sanghun Kim, Hwasup Lim

TL;DR

This work claims that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance, and designs a novel human animation framework using a denoising diffusion model.

Abstract

We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise. This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames. However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance. To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence.

Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation

TL;DR

This work claims that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance, and designs a novel human animation framework using a denoising diffusion model.

Abstract

We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise. This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames. However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance. To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence.
Paper Structure (27 sections, 4 equations, 18 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 4 equations, 18 figures, 3 tables, 1 algorithm.

Figures (18)

  • Figure 1: Our method generates temporally coherent human animation from various modalities.
  • Figure 2: Results from a unidirectional generative model with texture drifting over time.
  • Figure 3: The left illustration represents a unidirectional diffusion model, and the right one provides an overview of our proposed bidirectional temporal diffusion model (BTDM). The dotted arrows indicate the direction of conditioning, and $k$ and $t$ represent the denoising step and time interval, respectively.
  • Figure 4: The illustration of (a) our BTU-Net and (b) bidirectional attention block. The dotted squares in (a) represents bidirectional attention block. The small blue and pink squares in (a) indicate the intermediate feature of $E_p$ and $E_a$, respectively.
  • Figure 5: Qualitative comparisons for the single image animation task on graphics simulation (left) and UBC Fashion data (right).
  • ...and 13 more figures