Table of Contents
Fetching ...

Probabilistic Prompt Distribution Learning for Animal Pose Estimation

Jiyong Rao, Brian Nlong Zhao, Yu Wang

TL;DR

This work tackles multi-species animal pose estimation under strong visual variability and long-tail data distributions. It proposes PPAP, a probabilistic prompt distribution learning framework built on CLIP-based vision-language models, where each keypoint is described by multiple attribute prompts modeled as Gaussian distributions with means from a text decoder and variances from a visual-text decoder; samples are drawn via the reparameterization trick and regularized with a diversity term and KL prior. The method employs three cross-modal fusion strategies (heuristic, ensemble, attention) to align textual prompts with spatial visual features, and optimizes a total loss $L_{total}=L_{pred}+L_{spatial}+ L_{feature}+ L_{prompt}$. Empirically, PPAP achieves state-of-the-art results on AP-10K and AnimalKingdom in both supervised and zero-shot settings, demonstrating strong cross-species generalization and robustness to data imbalance. The approach provides a practical, plug-and-play mechanism to leverage textual priors for complex, cross-domain pose estimation tasks.

Abstract

Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, \textit{e.g.} CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings. The code is available at https://github.com/Raojiyong/PPAP.

Probabilistic Prompt Distribution Learning for Animal Pose Estimation

TL;DR

This work tackles multi-species animal pose estimation under strong visual variability and long-tail data distributions. It proposes PPAP, a probabilistic prompt distribution learning framework built on CLIP-based vision-language models, where each keypoint is described by multiple attribute prompts modeled as Gaussian distributions with means from a text decoder and variances from a visual-text decoder; samples are drawn via the reparameterization trick and regularized with a diversity term and KL prior. The method employs three cross-modal fusion strategies (heuristic, ensemble, attention) to align textual prompts with spatial visual features, and optimizes a total loss . Empirically, PPAP achieves state-of-the-art results on AP-10K and AnimalKingdom in both supervised and zero-shot settings, demonstrating strong cross-species generalization and robustness to data imbalance. The approach provides a practical, plug-and-play mechanism to leverage textual priors for complex, cross-domain pose estimation tasks.

Abstract

Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, \textit{e.g.} CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings. The code is available at https://github.com/Raojiyong/PPAP.

Paper Structure

This paper contains 18 sections, 13 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The motivation that one key point can be complementarily described in different views (An example of left shoulder).
  • Figure 2: Overall framework of PPAP. Firstly, we create $N_p$ keypoint attribute templates and generate distinctive embeddings for these attributes using a text encoder. Each keypoint prompt embedding is represented probabilistically as a multivariate Gaussian distribution, with its mean derived from the text decoder and variance from the visual-text decoder. Subsequently, we sample these keypoint prompt representations from the distribution and perform cross-modal fusion during the spatial adaptation stage to capture the spatial relationship between textual and visual content.
  • Figure 3: Qualitative visualization of several examples with CLAMP and our proposed PPAP.