Table of Contents
Fetching ...

HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance

Jue Gong, Tingyu Yang, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang

TL;DR

A degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train the proposed HAODiff, a human-aware one-step diffusion that surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets.

Abstract

Human-centered images often suffer from severe generic degradation during transmission and are prone to human motion blur (HMB), making restoration challenging. Existing research lacks sufficient focus on these issues, as both problems often coexist in practice. To address this, we design a degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train our proposed HAODiff, a human-aware one-step diffusion. Specifically, we propose a triple-branch dual-prompt guidance (DPG), which leverages high-quality images, residual noise (LQ minus HQ), and HMB segmentation masks as training targets. It produces a positive-negative prompt pair for classifier-free guidance (CFG) in a single diffusion step. The resulting adaptive dual prompts let HAODiff exploit CFG more effectively, boosting robustness against diverse degradations. For fair evaluation, we introduce MPII-Test, a benchmark rich in combined noise and HMB cases. Extensive experiments show that our HAODiff surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets, including our introduced MPII-Test. Code is available at: https://github.com/gobunu/HAODiff.

HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance

TL;DR

A degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train the proposed HAODiff, a human-aware one-step diffusion that surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets.

Abstract

Human-centered images often suffer from severe generic degradation during transmission and are prone to human motion blur (HMB), making restoration challenging. Existing research lacks sufficient focus on these issues, as both problems often coexist in practice. To address this, we design a degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train our proposed HAODiff, a human-aware one-step diffusion. Specifically, we propose a triple-branch dual-prompt guidance (DPG), which leverages high-quality images, residual noise (LQ minus HQ), and HMB segmentation masks as training targets. It produces a positive-negative prompt pair for classifier-free guidance (CFG) in a single diffusion step. The resulting adaptive dual prompts let HAODiff exploit CFG more effectively, boosting robustness against diverse degradations. For fair evaluation, we introduce MPII-Test, a benchmark rich in combined noise and HMB cases. Extensive experiments show that our HAODiff surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets, including our introduced MPII-Test. Code is available at: https://github.com/gobunu/HAODiff.

Paper Structure

This paper contains 13 sections, 11 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Performance comparison on introduced MPII-Test. HMB-R denotes the ratio of human motion blur detection instances before and after restoration. Lower-is-better metrics are inverted.
  • Figure 2: Degradation pipeline overview. The first order contains three possible cases: (i) no degradation, (ii) human motion blur (HMB), and (iii) generic degradation. The HMB branch conducts body-part segmentation to obtain masks, then morphs them to yield the spatial weight map $W_s$. This map is applied to the motion blur image $I_B$, which is generated by convolving the clean image $I_H$ with a point spread function (PSF) derived from a random trajectory. The result is combined with $I_H$ to create the synthetic HMB image $I_{\text{HMB}}$. The second applies conventional generic degradation.
  • Figure 3: Model structure of our HAODiff. Stage 1: We train a triple-branch dual-prompt guidance (DPG). The core structure consists of downsampler and upsampler ($H_D$, $H_{Ui}$), as well as feature extraction and reconstruction modules ($H_E$ and $H_{Ri}$). Both $H_E$ and $H_{Ri}$ are composed of two residual Swin Transformer blocks (RSTB). The three branches are individually trained with the human motion blur segmentation masks ($M_\text{HMB}$), residual noise ($I_L - I_H$), and high-quality images ($I_H$). Stage 2: We leverage DPG combined with prompt embedder to provide positive and negative prompt pairs to the one-step diffusion (OSD) model. The UNet generates $z_\text{pos}$ and $z_\text{neg}$, used to obtain the predicted latent vector $\hat{z}_H$ through classifier-free guidance (CFG) and denoising operations.
  • Figure 4: Structure of the prompt embedder. The Attention Pooling uses a learnable embedding as $Q$, while $K$ and $V$ from the output of Performer Encoder, whose depth $N$ is set to 6.
  • Figure 5: Visual comparison of the synthetic PERSONA-Val. Please zoom in for a better view.
  • ...and 3 more figures