Table of Contents
Fetching ...

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Tencent HY Team

Abstract

Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We will publicly release the AniMatrix model weights and inference code.

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Abstract

Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We will publicly release the AniMatrix model weights and inference code.

Paper Structure

This paper contains 114 sections, 20 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: The Industrial Production Taxonomy $\mathcal{T}=\mathcal{S}\times\mathcal{M}\times\mathcal{C}\times\mathcal{V}$. Every clip is mapped to a coordinate in this four-axis production-variable space---Style (rendering paradigm and motion dialect), Motion (performance semantics and kinetic intensity), Camera (cinematographic framing and choreography), and VFX (anime-specific symbolic and technical effects)---forming a structured, navigable control space that the model cannot self-discover from raw pixels. Appendix \ref{['app:taxonomy']} details the full axis definitions and vocabularies; Sec. \ref{['sec:data:caption']} shows how AniCaption infers these coordinates from clips and verbalizes them as directorial directives.
  • Figure 2: Overview of the Creator-Language Dual-Channel Conditioning architecture. Production tags are encoded by a trainable Tag Transformer via field--value decomposition, while free-form directives pass through a frozen umT5-XXL encoder. The two representations are injected into the MoE DiT through complementary pathways: concatenated sequences via cross-attention (Path 1) for fine-grained spatial/temporal control, and the global tag CLS vector via AdaLN modulation (Path 2) for enforcing overarching production attributes at every layer.
  • Figure 3: Qualitative comparison on two prompts at opposite extremes of the artistic-control spectrum (rows: AniMatrix, Wan2.2, Seedance-Pro 1.0; columns: temporally ordered samples). Example 1 (top, sakuga). A character lunges forward in a low stance, trailed by straight energy beams across the night sky. AniMatrix renders the lunge with crisp straight beams; Wan2.2 collapses the beams into deformed smears with motion blur; Seedance-Pro 1.0 emits no VFX and loses the actor at $t{\approx}1.5$ s, so its columns are sampled inside the on-screen window. Example 2 (bottom, group formation with magic shielding). Several ancient-costume characters gather into two rows inside a burning building while a blue magic shield condenses and expands, fireballs erupt from the windows, and the camera pushes forward. AniMatrix keeps the group choreography, shield expansion, and fireball timing aligned with the prompt; Wan2.2 forms a looser crowd and localizes the shield; Seedance-Pro 1.0 enlarges the shield but loses formation precision and fireball timing. The baselines fail in distinct artistic-correctness modes (VFX deformation/absence, loose formation, shield localization, timing drift) tracing back to a physics-biased prior.
  • Figure 4: Compact excerpt of a structured caption highlighting three distinguishing design choices: (i) the temporally ordered motion array uses cross-references such as <subject_0> to subjects; (ii) the AnimeVisualEffects field carries the three-level VFX hierarchy (type/sub_type/sub_sub_type); (iii) global style and camera tags are kept separate from per-entity annotations.
  • Figure 5: Full structured caption for a single clip, expanded from the excerpt in Fig. \ref{['fig:caption_json']}. Fields are organized into six groups: subjects (entity identity and position), motion (temporal action and expression), AnimeVisualEffects (hierarchical VFX annotations), global style tags, camera metadata, and environment (scene description). Cross-references such as <subject_0> link motion descriptions to specific subjects.
  • ...and 2 more figures