Probabilistic Prompt Distribution Learning for Animal Pose Estimation
Jiyong Rao, Brian Nlong Zhao, Yu Wang
TL;DR
This work tackles multi-species animal pose estimation under strong visual variability and long-tail data distributions. It proposes PPAP, a probabilistic prompt distribution learning framework built on CLIP-based vision-language models, where each keypoint is described by multiple attribute prompts modeled as Gaussian distributions with means from a text decoder and variances from a visual-text decoder; samples are drawn via the reparameterization trick and regularized with a diversity term and KL prior. The method employs three cross-modal fusion strategies (heuristic, ensemble, attention) to align textual prompts with spatial visual features, and optimizes a total loss $L_{total}=L_{pred}+L_{spatial}+ L_{feature}+ L_{prompt}$. Empirically, PPAP achieves state-of-the-art results on AP-10K and AnimalKingdom in both supervised and zero-shot settings, demonstrating strong cross-species generalization and robustness to data imbalance. The approach provides a practical, plug-and-play mechanism to leverage textual priors for complex, cross-domain pose estimation tasks.
Abstract
Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, \textit{e.g.} CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings. The code is available at https://github.com/Raojiyong/PPAP.
