Table of Contents
Fetching ...

LatentMan: Generating Consistent Animated Characters using Image Diffusion Models

Abdelrahman Eldesokey, Peter Wonka

TL;DR

LatentMan tackles the challenge of generating temporally coherent videos of animated characters in a zero-shot setting by combining text-driven motion diffusion with a pre-trained Text-to-Image model. It introduces Spatial Latent Alignment to propagate latent codes along cross-frame correspondences derived from DensePose and implements Pixel-Wise Guidance to reduce high-frequency frame-to-frame discrepancies. The approach leverages a text-based Motion Diffusion Model to provide continuous motion cues, rendered as depth maps for ControlNet conditioning, enabling bidirectional consistency between motion and appearance without reference videos. Quantitatively, LatentMan reduces temporal inconsistency by approximately 9–10% on a pixel-difference metric and is preferred by a majority of users in perceptual studies, demonstrating strong improvements over zero-shot baselines.

Abstract

We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos with continuous motion. We strive to bridge this gap, and we introduce LatentMan, which leverages existing text-based motion diffusion models to generate diverse continuous motions to guide the T2I model. To boost the temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies between frames. Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference. Project page https://abdo-eldesokey.github.io/latentman/.

LatentMan: Generating Consistent Animated Characters using Image Diffusion Models

TL;DR

LatentMan tackles the challenge of generating temporally coherent videos of animated characters in a zero-shot setting by combining text-driven motion diffusion with a pre-trained Text-to-Image model. It introduces Spatial Latent Alignment to propagate latent codes along cross-frame correspondences derived from DensePose and implements Pixel-Wise Guidance to reduce high-frequency frame-to-frame discrepancies. The approach leverages a text-based Motion Diffusion Model to provide continuous motion cues, rendered as depth maps for ControlNet conditioning, enabling bidirectional consistency between motion and appearance without reference videos. Quantitatively, LatentMan reduces temporal inconsistency by approximately 9–10% on a pixel-difference metric and is preferred by a majority of users in perceptual studies, demonstrating strong improvements over zero-shot baselines.

Abstract

We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos with continuous motion. We strive to bridge this gap, and we introduce LatentMan, which leverages existing text-based motion diffusion models to generate diverse continuous motions to guide the T2I model. To boost the temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies between frames. Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference. Project page https://abdo-eldesokey.github.io/latentman/.
Paper Structure (19 sections, 9 equations, 11 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 9 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Prompts used to generate videos for the user study.
  • Figure 2: Cross-Frame Attention (CFAttn) is adopted by multiple zero-shot T2V approaches to generate globally consistent video frames. However, when the conditioning signal (the depth map) changes, e.g. shifted up, the fine details (shown in the insets) tend to vary between frames. We find that this is caused by the distributional shift of the initial latent codes that are aligned with the character, as shown on the plot to the right. Our proposed approach attempts to align the latent codes in a zero-shot manner, eliminating the distribution shift and producing consistent images. *CN refers to ControlNet
  • Figure 2: DensePose embeddings for different frames have dissimilar distributions for the UV-coordinates. By computing cross-frame correspondences, we align these coordinates, and we obtain a pixel-wise mapping between the two frames.
  • Figure 3: An overview of our proposed approach. Given a text prompt $\mathcal{T}$, a motion diffusion model mdm produces a sequence of human skeletons that we use to obtain frame-wise depth maps and DensePose densepose. The former is used as guidance for ControlNet controlnet, while the latter is used to compute cross-frame correspondences. These correspondences are employed by the Spatial Latent Alignment and the Pixel-Wise Guidance modules to boost temporal consistency. The orange block shows an illustration of how we compute cross-frame correspondences between two frames for the "torso" body part based on DensePose. The blue block shows how we employ these correspondences to spatially align the latents to promote consistent synthesis.
  • Figure 3: An ablation study for different components of our pipeline: Spatial Latent Alignment (SLA), and Pixel-Wise Guidance (PWG).
  • ...and 6 more figures