AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

Liang An; Jin Lyu; Li Lin; Pujin Cheng; Yebin Liu; Xiaoying Tang

AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

Liang An, Jin Lyu, Li Lin, Pujin Cheng, Yebin Liu, Xiaoying Tang

TL;DR

AniMer+ tackles cross-taxa animal mesh recovery by combining a high-capacity Vision Transformer with a Mixture-of-Experts design to unify pose/shape estimation for Mammalia and Aves. It introduces CtrlAni3D and CtrlAVES3D, large-scale diffusion-based synthetic datasets that provide 3D supervision and alleviate depth ambiguities, especially for birds. A family-aware contrastive loss and a two-stage training regimen further boost generalization, enabling superior performance on in-domain and out-of-domain benchmarks (Animal Kingdom, CUB, Cow Bird) relative to prior methods. The work demonstrates strong cross-species generalization and offers a scalable blueprint for expanding parametric animal models to broader taxa, with potential downstream applications in behavior analysis and avatar creation.

Abstract

In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.

AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

TL;DR

Abstract

AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)