Animate Any Character in Any World
Yitong Wang, Fangyun Wei, Hongyang Zhang, Bo Dai, Yan Lu
TL;DR
AniX addresses the gap between static world models and controllable agents by enabling a user-provided 3DGS scene and character to be controlled through natural language in open-ended ways. It formulates video generation as a conditional autoregressive process with multi-modal conditioning on scene, character views, and text, and fine-tunes a pre-trained video generator using a GTA-V-based dataset with LoRA, augmented by preceding-token conditioning for long-horizon coherence. The approach achieves high visual fidelity, strong character consistency, broad action controllability including novel actions, and robust long-horizon generation, outperforming both foundation models and dedicated world models on WorldScore metrics. It also demonstrates efficient inference through distillation and supports scene and character customization, with real-world data further enhancing realism.
Abstract
Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object-centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.
