Table of Contents
Fetching ...

Animate Any Character in Any World

Yitong Wang, Fangyun Wei, Hongyang Zhang, Bo Dai, Yan Lu

TL;DR

AniX addresses the gap between static world models and controllable agents by enabling a user-provided 3DGS scene and character to be controlled through natural language in open-ended ways. It formulates video generation as a conditional autoregressive process with multi-modal conditioning on scene, character views, and text, and fine-tunes a pre-trained video generator using a GTA-V-based dataset with LoRA, augmented by preceding-token conditioning for long-horizon coherence. The approach achieves high visual fidelity, strong character consistency, broad action controllability including novel actions, and robust long-horizon generation, outperforming both foundation models and dedicated world models on WorldScore metrics. It also demonstrates efficient inference through distillation and supports scene and character customization, with real-world data further enhancing realism.

Abstract

Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object-centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.

Animate Any Character in Any World

TL;DR

AniX addresses the gap between static world models and controllable agents by enabling a user-provided 3DGS scene and character to be controlled through natural language in open-ended ways. It formulates video generation as a conditional autoregressive process with multi-modal conditioning on scene, character views, and text, and fine-tunes a pre-trained video generator using a GTA-V-based dataset with LoRA, augmented by preceding-token conditioning for long-horizon coherence. The approach achieves high visual fidelity, strong character consistency, broad action controllability including novel actions, and robust long-horizon generation, outperforming both foundation models and dedicated world models on WorldScore metrics. It also demonstrates efficient inference through distillation and supports scene and character customization, with real-world data further enhancing realism.

Abstract

Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object-centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.

Paper Structure

This paper contains 15 sections, 2 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: AniX enables users to provide 3DGS scene along with a 3D or multi-view character, enabling interactive control of the character’s behaviors and active exploration of the environment through natural language commands. The system features: (1) Consistent Environment and Character Fidelity, ensuring visual and spatial coherence with the user-provided scene and character; (2) a Rich Action Repertoire covering a wide range of behaviors, including locomotion, gestures, and object-centric interactions; (3) Long-Horizon, Temporally Coherent Interaction, enabling iterative user interaction while maintaining continuity across generated clips; and (4) Controllable Camera Behavior, which explicitly incorporates camera control—analogous to navigating 3DGS views—to produce accurate, user-specified viewpoints.
  • Figure 2: (a) Each training sample consists of a 3D character and a video depicting the character performing an action described by a short text. Through segmentation and inpainting, we obtain the corresponding scene video and character mask sequence. The VAE encoder is then applied to encode these inputs into tokens. (b) AniX predicts target video tokens conditioned on scene, mask, text, and multi-view character tokens within a Multi-Modal Diffusion Transformer, trained using Flow Matching lipman2022flow. Refer to Figure \ref{['fig:AR']} for the training process of the auto-regressive mode, which enables iterative interaction with AniX, and Figure \ref{['fig:inference']} for the inference.
  • Figure 3: Illustration of the auto-regressive mode. The only difference from the original architecture in Figure \ref{['fig:overview']} is the addition of an extra conditioning input—the preceding video tokens. Note that a misalignment exists between training and inference: during training, the preceding video tokens are derived from ground-truth videos, whereas during inference, they are generated by the model itself. To mitigate this discrepancy, we add a small Gaussian noise to the preceding video tokens during training and refer to the resulting tokens as augmented preceding video tokens.
  • Figure 4: Inference of AniX. (a) Users first specify the inputs, including the character, 3DGS scene, virtual camera location, and character anchor. (b) The user-provided text instruction is parsed, and a corresponding camera path is generated. Applying this path to the 3DGS scene produces a rendered scene video. (c) AniX then takes multiple inputs as conditions to generate the final output. Steps (b) and (c) can be performed iteratively, enabling temporally consistent, long-horizon interactions.
  • Figure 5: Screenshot visualizations of videos generated by AniX, showcasing different characters performing various actions across two scenes. Additional examples are provided in the appendix.
  • ...and 10 more figures