SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

Shuang Liang; Jing He; Chuanmeizhi Wang; Lejun Liao; Guo Zhang; Yingcong Chen; Yuan Yuan

SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

Shuang Liang, Jing He, Chuanmeizhi Wang, Lejun Liao, Guo Zhang, Yingcong Chen, Yuan Yuan

TL;DR

SDPose presents a diffusion-prior-based fine-tuning framework for human pose estimation that stays in the Stable Diffusion U-Net latent space, using a lightweight heatmap decoder and an auxiliary RGB reconstruction branch to boost cross-domain robustness. By exploiting multi-scale latent features and a deterministic x0-prediction setup, SDPose achieves competitive in-domain accuracy on COCO with far fewer fine-tuning epochs and sets new state-of-the-art results under domain shift on HumanArt and COCO-OOD. The authors further introduce COCO-OOD to benchmark style-induced generalization, and provide extensive ablations and latent-space analyses showing diffusion priors encode domain-invariant structure, as well as downstream benefits for pose-guided image and video generation. Overall, SDPose demonstrates efficient, robust pose estimation leveraging generative priors, with practical impact for animation, robotics, and controllable generation tasks.

Abstract

Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold and Lotus adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs remains underexplored. In this paper, we propose SDPose, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net's image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct COCO-OOD, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Extensive ablations highlight the importance of diffusion priors, RGB reconstruction, and multi-scale SD U-Net features for cross-domain generalization, and t-SNE analyses further explain SD's domain-invariant latent structure. We also show that SDPose serves as an effective zero-shot pose annotator for controllable image and video generation.

SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

TL;DR

Abstract

SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)