Table of Contents
Fetching ...

AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment

Yuanfeng Xu, Yuhao Chen, Zhongzhan Huang, Zijian He, Guangrun Wang, Philip Torr, Liang Lin

TL;DR

AnimateZoo tackles cross-species video animation by addressing pose misalignment with a zero-shot, diffusion-based framework trained on broad animal data. The method leverages three key components—Laplacian detail booster for texture, a prompt-tuned domain-specific identity extractor for appearance, and a scale-information remover to prevent shape leakage—within a ControlNet-inspired architecture that includes temporal layers and a two-stage training schedule. Quantitative and user studies show improved fidelity and temporal coherence over prior cross-species methods, validating robust action inheritance across diverse species while preserving background and identity. The work contributes two high-quality animal datasets and demonstrates practical potential for universal cross-species animation in entertainment and research contexts.

Abstract

Recent video editing advancements rely on accurate pose sequences to animate subjects. However, these efforts are not suitable for cross-species animation due to pose misalignment between species (for example, the poses of a cat differs greatly from that of a pig due to differences in body structure). In this paper, we present AnimateZoo, a zero-shot diffusion-based video generator to address this challenging cross-species animation issue, aiming to accurately produce animal animations while preserving the background. The key technique used in our AnimateZoo is subject alignment, which includes two steps. First, we improve appearance feature extraction by integrating a Laplacian detail booster and a prompt-tuning identity extractor. These components are specifically designed to capture essential appearance information, including identity and fine details. Second, we align shape features and address conflicts from differing subjects by introducing a scale-information remover. This ensures accurate cross-species animation. Moreover, we introduce two high-quality animal video datasets featuring a wide variety of species. Trained on these extensive datasets, our model is capable of generating videos characterized by accurate movements, consistent appearance, and high-fidelity frames, without the need for the pre-inference fine-tuning that prior arts required. Extensive experiments showcase the outstanding performance of our method in cross-species action following tasks, demonstrating exceptional shape adaptation capability. The project page is available at https://justinxu0.github.io/AnimateZoo/.

AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment

TL;DR

AnimateZoo tackles cross-species video animation by addressing pose misalignment with a zero-shot, diffusion-based framework trained on broad animal data. The method leverages three key components—Laplacian detail booster for texture, a prompt-tuned domain-specific identity extractor for appearance, and a scale-information remover to prevent shape leakage—within a ControlNet-inspired architecture that includes temporal layers and a two-stage training schedule. Quantitative and user studies show improved fidelity and temporal coherence over prior cross-species methods, validating robust action inheritance across diverse species while preserving background and identity. The work contributes two high-quality animal datasets and demonstrates practical potential for universal cross-species animation in entertainment and research contexts.

Abstract

Recent video editing advancements rely on accurate pose sequences to animate subjects. However, these efforts are not suitable for cross-species animation due to pose misalignment between species (for example, the poses of a cat differs greatly from that of a pig due to differences in body structure). In this paper, we present AnimateZoo, a zero-shot diffusion-based video generator to address this challenging cross-species animation issue, aiming to accurately produce animal animations while preserving the background. The key technique used in our AnimateZoo is subject alignment, which includes two steps. First, we improve appearance feature extraction by integrating a Laplacian detail booster and a prompt-tuning identity extractor. These components are specifically designed to capture essential appearance information, including identity and fine details. Second, we align shape features and address conflicts from differing subjects by introducing a scale-information remover. This ensures accurate cross-species animation. Moreover, we introduce two high-quality animal video datasets featuring a wide variety of species. Trained on these extensive datasets, our model is capable of generating videos characterized by accurate movements, consistent appearance, and high-fidelity frames, without the need for the pre-inference fine-tuning that prior arts required. Extensive experiments showcase the outstanding performance of our method in cross-species action following tasks, demonstrating exceptional shape adaptation capability. The project page is available at https://justinxu0.github.io/AnimateZoo/.
Paper Structure (34 sections, 12 figures, 4 tables)

This paper contains 34 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Faithful and controllable cross-species animation results of AnimateZoo. Without any parameter tuning, our model supports seamlessly inheriting actions from diverse animal species while preserving the scene and appearance information consistency.
  • Figure 2: Overview of AnimateZoo, which aims to leverage mismatched pose sequences for driving reference subject motion within a scene. Firstly, we eliminate the background from the reference Image with a segmentation tool. Subsequently, a meticulously crafted scale-information remover removes shape constraints, avoiding conflicts with mismatched skeletal points. Then, we use a Laplacian detail booster to enhance texture and edges in pixel space. Then, high-pass filters are set to preserve pixel-level texture features, which are then concatenated with the video scene slated for editing. Simultaneously, the identity extractor, driven by prompt tuning, extracts appearance features from the enhanced subject. Finally, pose sequence, appearance features, pixel-level textual information, and scene conditions are injected into the diffusion model equipped with temporal structures, facilitating the seamless synthesis result.
  • Figure 3: Effect of appropriately positioned skeletal points on motion control. The initial row within each group comprises the video slated for editing, juxtaposed with the reference image situated in the upper-right corner. Subsequently, the ensuing row delineates the outcome of video synthesis. (a) In instances necessitating the depiction of nuanced beak movements, the incorporation of supplementary annotation points facilitates finer manipulation. (b) Conversely, the omission of skeletal points at joints may yield inaccuracies in characterizing the movement of the entire trunk or limb, thereby engendering blurry outcomes. Distinguishing between the abdomen and tail of birds through separate labeling affords superior generation outcomes.
  • Figure 4: Comparison of with and without Laplacian detail boosters. (a) Without Laplacian detail booster. (b) With Laplacian detail booster. (c) Difference.
  • Figure 5: Visualizations of AnimateZoo and other advanced methods across various tasks are presented herein.
  • ...and 7 more figures