Table of Contents
Fetching ...

VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery

Wenhao Shen, Hao Wang, Wanqi Yin, Fayao Liu, Xulei Yang, Chao Liang, Zhongang Cai, Guosheng Lin

TL;DR

This work introduces a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes, and proposes a group preference alignment framework for finetuning diffusion-based HMR models.

Abstract

Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent diffusion-based methods tackle this by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Leveraging this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches.

VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery

TL;DR

This work introduces a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes, and proposes a group preference alignment framework for finetuning diffusion-based HMR models.

Abstract

Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent diffusion-based methods tackle this by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Leveraging this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches.
Paper Structure (15 sections, 13 equations, 4 figures, 4 tables)

This paper contains 15 sections, 13 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We introduce a VLM-guided HMR critique agent equipped with a dual-memory mechanism that delivers stable and semantically grounded assessments for groups of estimated 3D meshes. Building on these group-wise signals, our group preference alignment framework steers diffusion-based HMR models towards more coherent and reliable mesh generation.
  • Figure 2: Overview of our framework. Our purpose is to refine a diffusion-based HMR model that generates a group of human mesh predictions per input image. We propose a VLM-enhanced HMR critique agent that assigns a score for each human mesh prediction. This critique agent is equipped with a dual-memory mechanism to give stable assessments. Then, we use this critique agent to build a group-wise HMR preference dataset without the need for manual labeling. Finally, we employ this preference dataset to finetune the base model to preferentially generate predictions that are physically plausible and better aligned with the image cues.
  • Figure 3: Qualitative comparison between our method and the state-of-the-art probabilistic model ADHMR shen2025adhmr. Examples (a) $\sim$ (e) are from the 3DPW 3dpw dataset, while (f) $\sim$ (h) are challenging internet images. Both overlay and side-view results are shown.
  • Figure 4: Visualization of our critique agent on an internet image. We compare the erroneous initial score from HMR-Scorer shen2025adhmr with the corrected score from our critique agent, along with the corresponding critique.