Table of Contents
Fetching ...

SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

Shuang Liang, Jing He, Chuanmeizhi Wang, Lejun Liao, Guo Zhang, Yingcong Chen, Yuan Yuan

TL;DR

SDPose presents a diffusion-prior-based fine-tuning framework for human pose estimation that stays in the Stable Diffusion U-Net latent space, using a lightweight heatmap decoder and an auxiliary RGB reconstruction branch to boost cross-domain robustness. By exploiting multi-scale latent features and a deterministic x0-prediction setup, SDPose achieves competitive in-domain accuracy on COCO with far fewer fine-tuning epochs and sets new state-of-the-art results under domain shift on HumanArt and COCO-OOD. The authors further introduce COCO-OOD to benchmark style-induced generalization, and provide extensive ablations and latent-space analyses showing diffusion priors encode domain-invariant structure, as well as downstream benefits for pose-guided image and video generation. Overall, SDPose demonstrates efficient, robust pose estimation leveraging generative priors, with practical impact for animation, robotics, and controllable generation tasks.

Abstract

Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold and Lotus adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs remains underexplored. In this paper, we propose SDPose, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net's image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct COCO-OOD, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Extensive ablations highlight the importance of diffusion priors, RGB reconstruction, and multi-scale SD U-Net features for cross-domain generalization, and t-SNE analyses further explain SD's domain-invariant latent structure. We also show that SDPose serves as an effective zero-shot pose annotator for controllable image and video generation.

SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

TL;DR

SDPose presents a diffusion-prior-based fine-tuning framework for human pose estimation that stays in the Stable Diffusion U-Net latent space, using a lightweight heatmap decoder and an auxiliary RGB reconstruction branch to boost cross-domain robustness. By exploiting multi-scale latent features and a deterministic x0-prediction setup, SDPose achieves competitive in-domain accuracy on COCO with far fewer fine-tuning epochs and sets new state-of-the-art results under domain shift on HumanArt and COCO-OOD. The authors further introduce COCO-OOD to benchmark style-induced generalization, and provide extensive ablations and latent-space analyses showing diffusion priors encode domain-invariant structure, as well as downstream benefits for pose-guided image and video generation. Overall, SDPose demonstrates efficient, robust pose estimation leveraging generative priors, with practical impact for animation, robotics, and controllable generation tasks.

Abstract

Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold and Lotus adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs remains underexplored. In this paper, we propose SDPose, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net's image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct COCO-OOD, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Extensive ablations highlight the importance of diffusion priors, RGB reconstruction, and multi-scale SD U-Net features for cross-domain generalization, and t-SNE analyses further explain SD's domain-invariant latent structure. We also show that SDPose serves as an effective zero-shot pose annotator for controllable image and video generation.

Paper Structure

This paper contains 37 sections, 7 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: SDPose: OOD-robust pose via diffusion priors. On stylized paintings, SDPose surpasses Sapiens and ViTPose++-H, matching SoTA on COCO and setting new records on HumanArt and COCO-OOD; yellow boxes show baseline failures.
  • Figure 2: Training Pipeline of SDPose. The input RGB image is first encoded into the latent space by a pre-trained VAE. The U-Net is conditioned for multi-task learning via a class embedding. When the class label is set to [0,1], the U-Net predicts the reconstructed RGB latent; when set to [1,0], it produces features for heatmap prediction. The output layer of the U-Net is task-specific: the original convolutional output layer is retained for RGB latent reconstruction, while a lightweight heatmap decoder is used to process the U-Net’s intermediate features for keypoint heatmap prediction.
  • Figure 3: SDPose Inference Pipeline.
  • Figure 4: Qualitative results on real-world photographs. The yellow boxes highlight regions where baselines fail to predict accurate poses.
  • Figure 5: Illustration of the COCO-OOD dataset.
  • ...and 5 more figures