Table of Contents
Fetching ...

Human Body Restoration with One-Step Diffusion Model and A New Benchmark

Jue Gong, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang

TL;DR

This work tackles the absence of benchmarks for human body restoration by introducing HQ-ACF, a pipeline that curates the PERSONA dataset of 109,052 HQ 512×512 human images across diverse natural activities. It further introduces OSDHuman, a one-step diffusion model guided by a high-fidelity image embedder (HFIE) and optimized with variational score distillation (VSD) to align outputs with natural image distributions. Empirical results show that OSDHuman achieves superior visual quality and quantitative metrics on both synthetic and real-world PERSONA data, outperforming several baseline diffusion methods while reducing inference costs. The combination of a robust dataset and a specialized one-step model provides a practical, scalable solution for high-quality human body restoration with broad applicability in imaging and related tasks.

Abstract

Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (\emph{PERSONA}) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we propose \emph{OSDHuman}, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code will at https://github.com/gobunu/OSDHuman.

Human Body Restoration with One-Step Diffusion Model and A New Benchmark

TL;DR

This work tackles the absence of benchmarks for human body restoration by introducing HQ-ACF, a pipeline that curates the PERSONA dataset of 109,052 HQ 512×512 human images across diverse natural activities. It further introduces OSDHuman, a one-step diffusion model guided by a high-fidelity image embedder (HFIE) and optimized with variational score distillation (VSD) to align outputs with natural image distributions. Empirical results show that OSDHuman achieves superior visual quality and quantitative metrics on both synthetic and real-world PERSONA data, outperforming several baseline diffusion methods while reducing inference costs. The combination of a robust dataset and a specialized one-step model provides a practical, scalable solution for high-quality human body restoration with broad applicability in imaging and related tasks.

Abstract

Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (\emph{PERSONA}) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we propose \emph{OSDHuman}, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code will at https://github.com/gobunu/OSDHuman.

Paper Structure

This paper contains 13 sections, 7 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Comparison of no-reference image quality assessment metrics across human-related datasets. The object detection datasets specifically evaluate subsets with humans. Our proposed PERSONA dataset outperforms others significantly.
  • Figure 2: Visual examples of diffusion-based image restoration methods evaluated on PERSONA-test. The asterisk (*) indicates methods retrained on PERSONA dataset. Our OSDHuman produces more natural and faithful visual results compared to others.
  • Figure 3: High-quality dataset automated cropping and filtering pipeline. The pipeline consists of four stages. First, multiple datasets are collected, comprising millions of images. Images without labels are processed using YOLO11 for human detection. Then, a Laplacian operator is applied to compute image Laplacian variance, filtering out images below a threshold. Next, human boxes are adjusted to the square shape, and overly small or densely packed boxes are removed. Finally, cropped human images are evaluated using Image Quality Assessment (IQA) metrics. Images ranking in the top third by normalized metrics and exceeding the metric threshold are selected. These 109,052 images constitute the person-based restoration with sophisticated objects and natural activities (PERSONA) dataset.
  • Figure 4: Training Framework of OSDHuman. First, the LQ image $I_L$ is processed through the VAE Encoder, U-Net, and VAE Decoder, ultimately producing the restored HQ image $\hat{I}_H$. The conditional input of the U-Net is provided by the high-fidelity image embedder (HFIE). Second, during the training process, the $\hat{z}_H$ generated by the U-Net is subjected to noise and then passed through the pretrained and finetuned regularizers. $\mathcal{L}_{\text{VSD}}$ represents the distribution's difference between the model output and the natural image. $\mathcal{L}_{\text{VSD}}$, together with $\mathcal{L}_{\text{LPIPS}}$ and $\mathcal{L}_{\text{MSE}}$, constitutes the training objective. In summary, during the training stage, the VAE Encoder, U-Net, and finetuned regularizer are trained with LoRA, while other modules remain frozen. During inference, the VSD module is not utilized.
  • Figure 5: Comparison of the architectures of HFIE and DAPE.
  • ...and 4 more figures