Table of Contents
Fetching ...

DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery

Jing Gao, Ce Zheng, Laszlo A. Jeni, Zackory Erickson

TL;DR

This work tackles in-bed human mesh recovery under privacy-driven data scarcity by introducing DiSRT-In-Bed, a diffusion-based sim-to-real transfer framework that leverages extensive synthetic depth data and limited real-world samples. A diffusion U-Net denoises SMPL latent parameters conditioned on depth images, bridging the synthetic-real domain gap and delivering robust meshes across varying coverings and environments. The method combines physics-based synthetic data generation with a two-stage training strategy (synthetic pretraining and linearly scheduled real-data fine-tuning), achieving state-of-the-art performance on MPJPE and PVE metrics while showing strong generalization to hospital settings. The approach promises practical clinical impact by enabling accurate, privacy-preserving, and scalable in-bed mesh estimation in healthcare contexts.

Abstract

In-bed human mesh recovery can be crucial and enabling for several healthcare applications, including sleep pattern monitoring, rehabilitation support, and pressure ulcer prevention. However, it is difficult to collect large real-world visual datasets in this domain, in part due to privacy and expense constraints, which in turn presents significant challenges for training and deploying deep learning models. Existing in-bed human mesh estimation methods often rely heavily on real-world data, limiting their ability to generalize across different in-bed scenarios, such as varying coverings and environmental settings. To address this, we propose a Sim-to-Real Transfer Framework for in-bed human mesh recovery from overhead depth images, which leverages large-scale synthetic data alongside limited or no real-world samples. We introduce a diffusion model that bridges the gap between synthetic data and real data to support generalization in real-world in-bed pose and body inference scenarios. Extensive experiments and ablation studies validate the effectiveness of our framework, demonstrating significant improvements in robustness and adaptability across diverse healthcare scenarios.

DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery

TL;DR

This work tackles in-bed human mesh recovery under privacy-driven data scarcity by introducing DiSRT-In-Bed, a diffusion-based sim-to-real transfer framework that leverages extensive synthetic depth data and limited real-world samples. A diffusion U-Net denoises SMPL latent parameters conditioned on depth images, bridging the synthetic-real domain gap and delivering robust meshes across varying coverings and environments. The method combines physics-based synthetic data generation with a two-stage training strategy (synthetic pretraining and linearly scheduled real-data fine-tuning), achieving state-of-the-art performance on MPJPE and PVE metrics while showing strong generalization to hospital settings. The approach promises practical clinical impact by enabling accurate, privacy-preserving, and scalable in-bed mesh estimation in healthcare contexts.

Abstract

In-bed human mesh recovery can be crucial and enabling for several healthcare applications, including sleep pattern monitoring, rehabilitation support, and pressure ulcer prevention. However, it is difficult to collect large real-world visual datasets in this domain, in part due to privacy and expense constraints, which in turn presents significant challenges for training and deploying deep learning models. Existing in-bed human mesh estimation methods often rely heavily on real-world data, limiting their ability to generalize across different in-bed scenarios, such as varying coverings and environmental settings. To address this, we propose a Sim-to-Real Transfer Framework for in-bed human mesh recovery from overhead depth images, which leverages large-scale synthetic data alongside limited or no real-world samples. We introduce a diffusion model that bridges the gap between synthetic data and real data to support generalization in real-world in-bed pose and body inference scenarios. Extensive experiments and ablation studies validate the effectiveness of our framework, demonstrating significant improvements in robustness and adaptability across diverse healthcare scenarios.

Paper Structure

This paper contains 33 sections, 15 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Impact of real-world data scarcity on in-bed human mesh recovery. BodyMAP shows significant performance degradation when trained with limited real-world data, while our method maintains robust accuracy. 'Sim' indicates training with all synthetic data and '$n\%\text{Real}$' indicates training with $n\%$ of the real data from the training dataset.
  • Figure 2: Overview of the Proposed Sim-to-Real Transfer Framework. The framework comprises three stages: In the Synthetic Data Generation stage (left), a large, diverse set of synthetic depth images is generated within a simulated environment. In the training stage, the diffusion model $\mathcal{D}$ conditions on the synthetic depth image $\mathbf{c}_\text{syn}$ to denoise SMPL parameters $\mathbf{x}_t$ in the reverse process, which begins at timestep $T$ and progresses toward timestep 0, yielding the estimated human mesh $\hat{\mathcal{M}}_\text{syn}$. In the fine-tuning stage, the model conditions on real depth images $\mathbf{c}_\text{real}$ to estimate the human mesh $\hat{\mathcal{M}}_\text{real}$. The symbol '$g$' in the diffusion model indicates the gender flag associated with the input. The 'Ref.' in the figure denotes the corresponding synthetic depth image during training and the corresponding RGB image for visualization purposes only.
  • Figure 3: Diffusion Model Architecture. Dashed lines around specific layers indicate optional layers that may be omitted in certain blocks of the model implementation.
  • Figure 4: Visualization of Human Mesh Estimated from limited Real-World Data in Home Settings. The left three columns show the input depth images, RGB reference images, and ground truth mesh respectively. 'Sim' denotes using all simulation data in the training stage, 'n%Real' denotes the 'n' ratio of real data used in the fine-tuning in our method and jointly training in the baseline. 'Uncover' refers to no blanket in the bed, 'Cover 1' indicates the participant is covered with a thin blanket, and 'Cover 2' means the participant is covered with a thick blanket. The red arrows in the figures point out the mismatch between mesh prediction and the reference images.
  • Figure 5: Ablation Study on Diffusion-Based Sim-to-Real Transfer Framework.
  • ...and 8 more figures