Table of Contents
Fetching ...

DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

Yixuan Zhu, Ao Li, Yansong Tang, Wenliang Zhao, Jie Zhou, Jiwen Lu

TL;DR

Occluded 3D human mesh recovery is challenging due to weak image features under occlusion. This paper introduces DPMesh, which leverages a pre-trained diffusion model as a one-step image backbone with conditional control from 2D cues and a Noisy Key-point Reasoning module to exploit diffusion priors for occluded pose estimation. The method regresses SMPL parameters through a VQVAE-based pose representation guided by cross-attention maps and diffusion priors, without iterative denoising. Across occlusion and standard benchmarks, DPMesh achieves state-of-the-art performance, especially in crowded or heavily occluded scenes, highlighting the practical value of diffusion priors for perception tasks.

Abstract

The recovery of occluded human meshes presents challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper, we introduce DPMesh, an innovative framework for occluded human mesh recovery that capitalizes on the profound diffusion prior about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction, DPMesh seamlessly integrates the pre-trained denoising U-Net with potent knowledge as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses, DPMesh incorporates well-designed guidance via condition injection, which produces effective controls from 2D observations for the denoising U-Net. Furthermore, we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior, thereby enhancing accuracy. Extensive experiments affirm the efficacy of our framework, as we outperform state-of-the-art methods on both occlusion-specific and standard datasets. The persuasive results underscore its ability to achieve precise and robust 3D human mesh recovery, particularly in challenging scenarios involving occlusion and crowded scenes.

DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

TL;DR

Occluded 3D human mesh recovery is challenging due to weak image features under occlusion. This paper introduces DPMesh, which leverages a pre-trained diffusion model as a one-step image backbone with conditional control from 2D cues and a Noisy Key-point Reasoning module to exploit diffusion priors for occluded pose estimation. The method regresses SMPL parameters through a VQVAE-based pose representation guided by cross-attention maps and diffusion priors, without iterative denoising. Across occlusion and standard benchmarks, DPMesh achieves state-of-the-art performance, especially in crowded or heavily occluded scenes, highlighting the practical value of diffusion priors for perception tasks.

Abstract

The recovery of occluded human meshes presents challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper, we introduce DPMesh, an innovative framework for occluded human mesh recovery that capitalizes on the profound diffusion prior about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction, DPMesh seamlessly integrates the pre-trained denoising U-Net with potent knowledge as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses, DPMesh incorporates well-designed guidance via condition injection, which produces effective controls from 2D observations for the denoising U-Net. Furthermore, we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior, thereby enhancing accuracy. Extensive experiments affirm the efficacy of our framework, as we outperform state-of-the-art methods on both occlusion-specific and standard datasets. The persuasive results underscore its ability to achieve precise and robust 3D human mesh recovery, particularly in challenging scenarios involving occlusion and crowded scenes.
Paper Structure (16 sections, 11 equations, 9 figures, 11 tables)

This paper contains 16 sections, 11 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Main idea of the proposed DPMesh framework. We design an innovative framework to fully exploit rich prior knowledge about human structure and spatial interaction of the pre-trained diffusion model for challenging occluded human mesh recovery task. By simply adapting the denoising U-Net as a single-step backbone with spatial conditions, we achieve accurate human mesh recovery even under severe occlusions.
  • Figure 2: Comparison of current methods and the proposed DPMesh. (a) Conventional methods hmrkolotouros2019learning apply a feature extractor $\mathcal{G}$ and a regressor $\mathcal{H}$ to obtain SMPL parameters. (b) Diffusion-based methods feng2023diffposecho2023generativehmdiff2023distribution propose an iterative framework that harnesses multiple denoising steps to progressively refine the pose parameters from random noise. (c) Distinct from previous diffusion-based techniques, our DPMesh employs the pre-trained denoising U-Net as the backbone $\mathcal{G}$, executing a one-step inference to furnish informative features for the regressor. This novel framework transfers the potent perception knowledge in generative models onto conventional frameworks.
  • Figure 3: The overall framework of DPMesh. Given the input image ${\bm x}$, pre-detected 2D key-points $J_{\rm 2D}$, and generated heatmap $H_{\rm 2D}$, our framework begins with the extraction of image features $\mathcal{F}^S$ through a single denoising step using the pre-trained diffusion model. This process is guided by the designed spatial conditions. Then, we input $\mathcal{F}^S$ to the regressor to predict SMPL parameters $\Theta$, $\beta$ and $\pi$, ultimately generating the final mesh. To further enhance the estimation robustness against noisy 2D observations, we leverage a noisy key-point reasoning approach. This involves pre-training a teacher model with ground truth heatmaps and then training a student model using noisy heatmaps by aligning the feature maps computed from the locked teacher and the student.
  • Figure 4: Qualitative comparisons on 3DPW dataset 3dpw. Our DPMesh recovers accurate human meshes under challenging occlusions and demonstrates an adept understanding of 3D body structures and spatial relationships. Notably, our method also excels in generating plausible details for the obscured body parts, e.g., hands and legs, proving the robustness of DPMesh in handling complex scenarios.
  • Figure 5: Qualitative results on 3DOH dataset 3dpw. Our DPMesh obtains satisfying estimation for complex poses.
  • ...and 4 more figures