Table of Contents
Fetching ...

WildActor: Unconstrained Identity-Preserving Video Generation

Qin Guo, Tianyu Yang, Xuanhua He, Fei Shen, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Dan Xu

TL;DR

This work presents Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments, and proposes WildActor, a framework for any-view conditioned human video generation.

Abstract

Production-ready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level consistency, or produce copy-paste artifacts where subjects appear rigid due to pose locking. We present Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments. Actor-18M comprises 1.6M videos with 18M corresponding human images, covering both arbitrary views and canonical three-view representations. Leveraging Actor-18M, we propose WildActor, a framework for any-view conditioned human video generation. We introduce an Asymmetric Identity-Preserving Attention mechanism coupled with a Viewpoint-Adaptive Monte Carlo Sampling strategy that iteratively re-weights reference conditions by marginal utility for balanced manifold coverage. Evaluated on the proposed Actor-Bench, WildActor consistently preserves body identity under diverse shot compositions, large viewpoint transitions, and substantial motions, surpassing existing methods in these challenging settings.

WildActor: Unconstrained Identity-Preserving Video Generation

TL;DR

This work presents Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments, and proposes WildActor, a framework for any-view conditioned human video generation.

Abstract

Production-ready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level consistency, or produce copy-paste artifacts where subjects appear rigid due to pose locking. We present Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments. Actor-18M comprises 1.6M videos with 18M corresponding human images, covering both arbitrary views and canonical three-view representations. Leveraging Actor-18M, we propose WildActor, a framework for any-view conditioned human video generation. We introduce an Asymmetric Identity-Preserving Attention mechanism coupled with a Viewpoint-Adaptive Monte Carlo Sampling strategy that iteratively re-weights reference conditions by marginal utility for balanced manifold coverage. Evaluated on the proposed Actor-Bench, WildActor consistently preserves body identity under diverse shot compositions, large viewpoint transitions, and substantial motions, surpassing existing methods in these challenging settings.
Paper Structure (24 sections, 6 equations, 4 figures, 4 tables)

This paper contains 24 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Construction pipeline and representative samples of Actor-18M.Left: Frames are sampled and filtered from identity-consistent videos, from which facial and body images are extracted as ground-truth references. Right: Based on these references, view-transformed samples are generated to construct Actor-18M-A, while attribute-conditioned image editing under diverse environments, lighting conditions, and motions produces Actor-18M-B. Canonical three-view images in Actor-18M-C are generated using Nano-Banana, guided by frames with the highest visibility selected from different viewpoints, serving as complete identity anchors.
  • Figure 2: Overview of WildActor.Left: The overall architecture, where a video DiT is conditioned on multi-view facial and body reference images selected via Viewpoint-Adaptive Monte Carlo Sampling. Reference tokens are embedded with I-RoPE to distinguish them from video tokens in the shared spatio-temporal attention space. Right: Details of AIPA, illustrating how identity reference tokens are aggregated and injected into video tokens through asymmetric attention while preserving prior of backbone.
  • Figure 3: Qualitative comparison on sequential narrative.WildActor maintains stronger full-body consistency and prompt adherence than prior methods under viewpoint changes, camera motion, and scene transitions. Zoom in to better compare fine-grained details.
  • Figure 5: Ablation study on data strategies and model components. We evaluate variants in controlled scenes with turning motions, enabling clear comparison under viewpoint changes.