Table of Contents
Fetching ...

AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

Liang An, Jin Lyu, Li Lin, Pujin Cheng, Yebin Liu, Xiaoying Tang

TL;DR

AniMer+ tackles cross-taxa animal mesh recovery by combining a high-capacity Vision Transformer with a Mixture-of-Experts design to unify pose/shape estimation for Mammalia and Aves. It introduces CtrlAni3D and CtrlAVES3D, large-scale diffusion-based synthetic datasets that provide 3D supervision and alleviate depth ambiguities, especially for birds. A family-aware contrastive loss and a two-stage training regimen further boost generalization, enabling superior performance on in-domain and out-of-domain benchmarks (Animal Kingdom, CUB, Cow Bird) relative to prior methods. The work demonstrates strong cross-species generalization and offers a scalable blueprint for expanding parametric animal models to broader taxa, with potential downstream applications in behavior analysis and avatar creation.

Abstract

In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.

AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

TL;DR

AniMer+ tackles cross-taxa animal mesh recovery by combining a high-capacity Vision Transformer with a Mixture-of-Experts design to unify pose/shape estimation for Mammalia and Aves. It introduces CtrlAni3D and CtrlAVES3D, large-scale diffusion-based synthetic datasets that provide 3D supervision and alleviate depth ambiguities, especially for birds. A family-aware contrastive loss and a two-stage training regimen further boost generalization, enabling superior performance on in-domain and out-of-domain benchmarks (Animal Kingdom, CUB, Cow Bird) relative to prior methods. The work demonstrates strong cross-species generalization and offers a scalable blueprint for expanding parametric animal models to broader taxa, with potential downstream applications in behavior analysis and avatar creation.

Abstract

In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.

Paper Structure

This paper contains 31 sections, 7 equations, 12 figures, 22 tables.

Figures (12)

  • Figure 1: Statistics and representative samples of the CtrlAVES3D and CtrlAni3D datasets.(a) and (b) represent samples from the CtrlAVES3D dataset and the CtrlAni3D dataset, respectively. For each image pair, the left side displays the generated animal image whose background comes from either COCO lin2014microsoft or AI-synthesis, and the right side presents the rendered mesh label. Note that the synthesis process naturally considers generating truncated images. (c) shows the statistics of both datasets. More details about these two datasets can be found in Table \ref{['tab:taxonomy']}, Table \ref{['tab:taxonomy_bird']} and Sec. \ref{['sec:ctrlani3d']}.
  • Figure 2: AniMer+ network architecture. AniMer+ consists of (1) a ViT-MoE encoder with N ViT-MoE blocks that extract quadrupedal or avian image features; (2) a Transformer decoder that processes features generated by the encoder; (3) a predictor head (MLPs) that generates animal family features for supervised contrastive learning; and (4) a regression head (MLPs) that estimates the parametric model parameters.
  • Figure 3: Dataset generation pipeline. The whole pipeline contains three parts: (a) Text prompt generation. (b) Condition image generation. (c) Image generation and post-processing.
  • Figure 4: CtrlAni3D and CtrlAVES3D failure cases and successful cases. (a) Failure cases. There are two main cases of failure: (1) At times, ControlNet may struggle to generate mesh-aligned poses (first row and second row). (2) Additionally, ControlNet may not effectively generate the intricate details of the animal body (third row and fourth row). (b) Successful cases. The backgrounds of the first and second rows are generated by ControlNet, while the backgrounds of the third and fourth rows are sourced from the COCO dataset.
  • Figure 5: Qualitative results of AniMer+ on Animal Kingdom and CUB datasets. For each case, we display the input image and the output result, which include both a front view rendering and a side view rendering. The mammals and birds are sourced from the Animal Kingdom dataset and the CUB dataset, respectively.
  • ...and 7 more figures