Table of Contents
Fetching ...

EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation

Qiang Qu, Ming Li, Xiaoming Chen, Tongliang Liu

TL;DR

EvAnimate introduces a diffusion-based framework that uses event-camera data as motion cues to animate static human images, addressing limitations of traditional frame-based cues such as low temporal resolution and motion blur. It converts asynchronous events into a three-channel TC B-slice representation and employs a dual-branch architecture with an EvPose latent space and a Motion Gradient Alignment loss to achieve high temporal fidelity and robust performance under challenging lighting and motion. The approach is validated on both simulated (EvTikTok) and real-world (EvHumanMotion) datasets, outperforming state-of-the-art methods across scenarios and temporal resolutions, and demonstrating strong cross-person generalization through targeted augmentations. The work provides new benchmarks and datasets, enabling future exploration of event-conditioned human animation and real-world applicability in uncontrolled environments like concerts and outdoor performances.

Abstract

Conditional human animation traditionally animates static reference images using pose-based motion cues extracted from video data. However, these video-derived cues often suffer from low temporal resolution, motion blur, and unreliable performance under challenging lighting conditions. In contrast, event cameras inherently provide robust and high temporal-resolution motion information, offering resilience to motion blur, low-light environments, and exposure variations. In this paper, we propose EvAnimate, the first method leveraging event streams as robust and precise motion cues for conditional human image animation. Our approach is fully compatible with diffusion-based generative models, enabled by encoding asynchronous event data into a specialized three-channel representation with adaptive slicing rates and densities. High-quality and temporally coherent animations are achieved through a dual-branch architecture explicitly designed to exploit event-driven dynamics, significantly enhancing performance under challenging real-world conditions. Enhanced cross-subject generalization is further achieved using specialized augmentation strategies. To facilitate future research, we establish a new benchmarking, including simulated event data for training and validation, and a real-world event dataset capturing human actions under normal and challenging scenarios. The experiment results demonstrate that EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.

EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation

TL;DR

EvAnimate introduces a diffusion-based framework that uses event-camera data as motion cues to animate static human images, addressing limitations of traditional frame-based cues such as low temporal resolution and motion blur. It converts asynchronous events into a three-channel TC B-slice representation and employs a dual-branch architecture with an EvPose latent space and a Motion Gradient Alignment loss to achieve high temporal fidelity and robust performance under challenging lighting and motion. The approach is validated on both simulated (EvTikTok) and real-world (EvHumanMotion) datasets, outperforming state-of-the-art methods across scenarios and temporal resolutions, and demonstrating strong cross-person generalization through targeted augmentations. The work provides new benchmarks and datasets, enabling future exploration of event-conditioned human animation and real-world applicability in uncontrolled environments like concerts and outdoor performances.

Abstract

Conditional human animation traditionally animates static reference images using pose-based motion cues extracted from video data. However, these video-derived cues often suffer from low temporal resolution, motion blur, and unreliable performance under challenging lighting conditions. In contrast, event cameras inherently provide robust and high temporal-resolution motion information, offering resilience to motion blur, low-light environments, and exposure variations. In this paper, we propose EvAnimate, the first method leveraging event streams as robust and precise motion cues for conditional human image animation. Our approach is fully compatible with diffusion-based generative models, enabled by encoding asynchronous event data into a specialized three-channel representation with adaptive slicing rates and densities. High-quality and temporally coherent animations are achieved through a dual-branch architecture explicitly designed to exploit event-driven dynamics, significantly enhancing performance under challenging real-world conditions. Enhanced cross-subject generalization is further achieved using specialized augmentation strategies. To facilitate future research, we establish a new benchmarking, including simulated event data for training and validation, and a real-world event dataset capturing human actions under normal and challenging scenarios. The experiment results demonstrate that EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.

Paper Structure

This paper contains 27 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison between the conventional image-to-video methods and the proposed EvAnimate framework.EvAnimate leverages event streams as motion cues to generate controllable videos at high temporal resolutions. Moreover, EvAnimate produces superior video quality and exhibits enhanced robustness under challenging scenarios such as motion blur, low-light, and overexposure.
  • Figure 2: Overview of the proposed event-conditioned human animation framework.
  • Figure 3: Comparison of Event Representations. The proposed TCB‐slices avoid the over-dense output seen with fixed-duration windows at low frame rates and the sparse output at high frame rates, while also enabling precise control over the frame rate compared to fixed-size windows.
  • Figure 4: Structure of the video generation module. At its core, a spatial-temporal UNet generates latent representations of video frames. Four key components guide the process: (1) Reference Image Alignment preserves the visual characteristics of the input by projecting the reference image into the latent space via a VAE and integrating semantic features from CLIP and face encoders; (2) Event Condition Alignment controls motion by estimating pose from event signals and jointly encoding pose and event representations using a dual-encoder (EvPose Encoder); (3) Diffusion Loss serves as the primary training objective by matching the latent representations of generated and ground truth videos; and (4) Motion Gradient Alignment Loss leverages event conditions to enforce consistent, realistic motion dynamics.
  • Figure 5: Qualitative comparison of EvAnimate with state-of-the-art methods across various scenarios (low light, overexposure, motion blur, normal). The first column shows the reference image, followed by animations generated by AnimateAnyone, MagicPose, MagicAnimate, StableAnimator, and our method. EvAnimate consistently achieves superior visual fidelity and accurate motion reproduction, especially under challenging conditions.
  • ...and 1 more figures