Table of Contents
Fetching ...

DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior

Yiming Zhang, Zhe Wang, Xinjie Li, Yunchen Yuan, Chengsong Zhang, Xiao Sun, Zhihang Zhong, Jian Wang

TL;DR

DiffBody addresses artifacts in human body restoration that arise when applying general restoration models to portraits and body images. It introduces a body-aware diffusion framework that integrates pose and attention priors, text guidance via GPT-4V, and a body-part aware diffusion sampler, trained on a new 140k-image dataset assembled from SHHQ, DeepFashion, and Web-human sources. The method achieves superior performance on quantitative metrics (e.g., SSIM, LPIPS, MANIQA, CLIPIQA) and qualitative assessments, including a user study, outperforming state-of-the-art baselines. This work enhances practical human body restoration with structured multimodal conditioning and opens avenues for more nuanced control and identity-preserving restorations.

Abstract

Human body restoration plays a vital role in various applications related to the human body. Despite recent advances in general image restoration using generative models, their performance in human body restoration remains mediocre, often resulting in foreground and background blending, over-smoothing surface textures, missing accessories, and distorted limbs. Addressing these challenges, we propose a novel approach by constructing a human body-aware diffusion model that leverages domain-specific knowledge to enhance performance. Specifically, we employ a pretrained body attention module to guide the diffusion model's focus on the foreground, addressing issues caused by blending between the subject and background. We also demonstrate the value of revisiting the language modality of the diffusion model in restoration tasks by seamlessly incorporating text prompt to improve the quality of surface texture and additional clothing and accessories details. Additionally, we introduce a diffusion sampler tailored for fine-grained human body parts, utilizing local semantic information to rectify limb distortions. Lastly, we collect a comprehensive dataset for benchmarking and advancing the field of human body restoration. Extensive experimental validation showcases the superiority of our approach, both quantitatively and qualitatively, over existing methods.

DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior

TL;DR

DiffBody addresses artifacts in human body restoration that arise when applying general restoration models to portraits and body images. It introduces a body-aware diffusion framework that integrates pose and attention priors, text guidance via GPT-4V, and a body-part aware diffusion sampler, trained on a new 140k-image dataset assembled from SHHQ, DeepFashion, and Web-human sources. The method achieves superior performance on quantitative metrics (e.g., SSIM, LPIPS, MANIQA, CLIPIQA) and qualitative assessments, including a user study, outperforming state-of-the-art baselines. This work enhances practical human body restoration with structured multimodal conditioning and opens avenues for more nuanced control and identity-preserving restorations.

Abstract

Human body restoration plays a vital role in various applications related to the human body. Despite recent advances in general image restoration using generative models, their performance in human body restoration remains mediocre, often resulting in foreground and background blending, over-smoothing surface textures, missing accessories, and distorted limbs. Addressing these challenges, we propose a novel approach by constructing a human body-aware diffusion model that leverages domain-specific knowledge to enhance performance. Specifically, we employ a pretrained body attention module to guide the diffusion model's focus on the foreground, addressing issues caused by blending between the subject and background. We also demonstrate the value of revisiting the language modality of the diffusion model in restoration tasks by seamlessly incorporating text prompt to improve the quality of surface texture and additional clothing and accessories details. Additionally, we introduce a diffusion sampler tailored for fine-grained human body parts, utilizing local semantic information to rectify limb distortions. Lastly, we collect a comprehensive dataset for benchmarking and advancing the field of human body restoration. Extensive experimental validation showcases the superiority of our approach, both quantitatively and qualitatively, over existing methods.
Paper Structure (19 sections, 6 equations, 17 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 6 equations, 17 figures, 1 table, 1 algorithm.

Figures (17)

  • Figure 1: Comparison between our model and baseline (Left: Baseline, Right: Ours, Top left corner: LQ input). Comparing to baseline, our model has better performance on problems labeled below each image.
  • Figure 1: Detailed prompt we provide to GPT-4V to caption our dataset.
  • Figure 2: The structure of DiffBody. First, we train the SwinIR model using our proposed dataset and process the low-quality image $I_{LQ}$ to obtain preliminary restored image $I_{reg}$ with the trained model. In addition, pose map $I_{pose}$ and attention map $I_{attn}$ are extracted from $I_{reg}$ using existing methods. Afterwards, $I_{reg}$ and $I_{pose}$ are passed into the pre-trained VAE Encoder, then concatenated together with $I_{attn}$ and fed to the trainable copy of SD Encoder. Additionally, we also utilize the textual information (Sec 3.3) and a novel human-centric sampling (Sec 3.4) to enhance the restoration capability. Please see corresponding sections for details.
  • Figure 2: Visual comparison of DiffBody and other general SOTA methods. Compared to other methods, our model is more effective in generating detailed limbs.
  • Figure 3: During training, texts in black are fed to the model. Texts in green reflect the generative logic of GPT-4V in captioning images.
  • ...and 12 more figures