EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation
Qiang Qu, Ming Li, Xiaoming Chen, Tongliang Liu
TL;DR
EvAnimate introduces a diffusion-based framework that uses event-camera data as motion cues to animate static human images, addressing limitations of traditional frame-based cues such as low temporal resolution and motion blur. It converts asynchronous events into a three-channel TC B-slice representation and employs a dual-branch architecture with an EvPose latent space and a Motion Gradient Alignment loss to achieve high temporal fidelity and robust performance under challenging lighting and motion. The approach is validated on both simulated (EvTikTok) and real-world (EvHumanMotion) datasets, outperforming state-of-the-art methods across scenarios and temporal resolutions, and demonstrating strong cross-person generalization through targeted augmentations. The work provides new benchmarks and datasets, enabling future exploration of event-conditioned human animation and real-world applicability in uncontrolled environments like concerts and outdoor performances.
Abstract
Conditional human animation traditionally animates static reference images using pose-based motion cues extracted from video data. However, these video-derived cues often suffer from low temporal resolution, motion blur, and unreliable performance under challenging lighting conditions. In contrast, event cameras inherently provide robust and high temporal-resolution motion information, offering resilience to motion blur, low-light environments, and exposure variations. In this paper, we propose EvAnimate, the first method leveraging event streams as robust and precise motion cues for conditional human image animation. Our approach is fully compatible with diffusion-based generative models, enabled by encoding asynchronous event data into a specialized three-channel representation with adaptive slicing rates and densities. High-quality and temporally coherent animations are achieved through a dual-branch architecture explicitly designed to exploit event-driven dynamics, significantly enhancing performance under challenging real-world conditions. Enhanced cross-subject generalization is further achieved using specialized augmentation strategies. To facilitate future research, we establish a new benchmarking, including simulated event data for training and validation, and a real-world event dataset capturing human actions under normal and challenging scenarios. The experiment results demonstrate that EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.
