AP-CAP: Advancing High-Quality Data Synthesis for Animal Pose Estimation via a Controllable Image Generation Pipeline
Lei Wang, Yujie Zhong, Xiaopeng Sun, Jingchun Cheng, Chengjian Feng, Qiong Cao, Lin Ma, Zhaoxin Fan
TL;DR
AP-CAP addresses the bottleneck of scarce high quality data for 2D animal pose estimation by introducing a diffusion based controllable image generation pipeline. It combines a Multi-Modal Animal Image Generation Model with three data synthesis strategies MF-AISS, PA-AISS and CE-AISS to generate pose annotated images and create MPCH, the first large-scale hybrid synthetic-real dataset for animal pose estimation. The diffusion-based generator is trained end-to-end and leveraged to improve both in-domain performance and cross-domain generalization across diverse species. The approach achieves consistent gains across multiple backbones and datasets, and the MPCH benchmark provides a new, scalable resource for evaluating animal pose estimation in cross-domain settings.
Abstract
The task of 2D animal pose estimation plays a crucial role in advancing deep learning applications in animal behavior analysis and ecological research. Despite notable progress in some existing approaches, our study reveals that the scarcity of high-quality datasets remains a significant bottleneck, limiting the full potential of current methods. To address this challenge, we propose a novel Controllable Image Generation Pipeline for synthesizing animal pose estimation data, termed AP-CAP. Within this pipeline, we introduce a Multi-Modal Animal Image Generation Model capable of producing images with expected poses. To enhance the quality and diversity of the generated data, we further propose three innovative strategies: (1) Modality-Fusion-Based Animal Image Synthesis Strategy to integrate multi-source appearance representations, (2) Pose-Adjustment-Based Animal Image Synthesis Strategy to dynamically capture diverse pose variations, and (3) Caption-Enhancement-Based Animal Image Synthesis Strategy to enrich visual semantic understanding. Leveraging the proposed model and strategies, we create the MPCH Dataset (Modality-Pose-Caption Hybrid), the first hybrid dataset that innovatively combines synthetic and real data, establishing the largest-scale multi-source heterogeneous benchmark repository for animal pose estimation to date. Extensive experiments demonstrate the superiority of our method in improving both the performance and generalization capability of animal pose estimators.
