Table of Contents
Fetching ...

MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing

Jinghan Yu, Junhao Xiao, Zhiyuan Ma, Yue Ma, Kaiqi Liu, Yuhan Wang, Daizong Liu, Xianghao Meng, Jianjun Li

TL;DR

This work tackles the challenge of precise, multi-instance human erasing in complex scenes where occlusion, entanglement, and background interference impede faithful restoration. It introduces MILD, a Multi-Layer Diffusion framework that disentangles each foreground instance from the background by producing per-instance foreground layers and a background layer using a shared UNet backbone with domain-specific LoRA adapters. The authors formalize Cross-Domain Attention Gap (CAG) and augment the architecture with Human Morphology Guidance (HMG) and Spatially-Modulated Attention (SMA) to maximize attention separation and suppress semantic leakage, enabling instance-aware generation and flexible scene recomposition. A high-quality MILD dataset is released for training and evaluation, and experiments demonstrate state-of-the-art performance on challenging human-erasing tasks across perceptual, semantic, and structural metrics, with strong generalization to open-domain scenes and detailed ablations supporting the design choices.

Abstract

Recent years have witnessed the success of diffusion models in image customization tasks. However, existing mask-guided human erasing methods still struggle in complex scenarios such as human-human occlusion, human-object entanglement, and human-background interference, mainly due to the lack of large-scale multi-instance datasets and effective spatial decoupling to separate foreground from background. To bridge these gaps, we curate the MILD dataset capturing diverse poses, occlusions, and complex multi-instance interactions. We then define the Cross-Domain Attention Gap (CAG), an attention-gap metric to quantify semantic leakage. On top of these, we propose Multi-Layer Diffusion (MILD), which decomposes the generation process into independent denoising pathways, enabling separate reconstruction of each foreground instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, a plug-and-play module that incorporates pose, parsing, and spatial relationships into the diffusion process to improve structural awareness and restoration quality. Additionally, we present Spatially-Modulated Attention, an adaptive mechanism that leverages spatial mask priors to modulate attention across semantic regions, further widening the CAG to effectively minimize boundary artifacts and mitigate semantic leakage. Experiments show that MILD significantly outperforms existing methods. Datasets and code are publicly available at: https://mild-multi-layer-diffusion.github.io/.

MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing

TL;DR

This work tackles the challenge of precise, multi-instance human erasing in complex scenes where occlusion, entanglement, and background interference impede faithful restoration. It introduces MILD, a Multi-Layer Diffusion framework that disentangles each foreground instance from the background by producing per-instance foreground layers and a background layer using a shared UNet backbone with domain-specific LoRA adapters. The authors formalize Cross-Domain Attention Gap (CAG) and augment the architecture with Human Morphology Guidance (HMG) and Spatially-Modulated Attention (SMA) to maximize attention separation and suppress semantic leakage, enabling instance-aware generation and flexible scene recomposition. A high-quality MILD dataset is released for training and evaluation, and experiments demonstrate state-of-the-art performance on challenging human-erasing tasks across perceptual, semantic, and structural metrics, with strong generalization to open-domain scenes and detailed ablations supporting the design choices.

Abstract

Recent years have witnessed the success of diffusion models in image customization tasks. However, existing mask-guided human erasing methods still struggle in complex scenarios such as human-human occlusion, human-object entanglement, and human-background interference, mainly due to the lack of large-scale multi-instance datasets and effective spatial decoupling to separate foreground from background. To bridge these gaps, we curate the MILD dataset capturing diverse poses, occlusions, and complex multi-instance interactions. We then define the Cross-Domain Attention Gap (CAG), an attention-gap metric to quantify semantic leakage. On top of these, we propose Multi-Layer Diffusion (MILD), which decomposes the generation process into independent denoising pathways, enabling separate reconstruction of each foreground instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, a plug-and-play module that incorporates pose, parsing, and spatial relationships into the diffusion process to improve structural awareness and restoration quality. Additionally, we present Spatially-Modulated Attention, an adaptive mechanism that leverages spatial mask priors to modulate attention across semantic regions, further widening the CAG to effectively minimize boundary artifacts and mitigate semantic leakage. Experiments show that MILD significantly outperforms existing methods. Datasets and code are publicly available at: https://mild-multi-layer-diffusion.github.io/.

Paper Structure

This paper contains 53 sections, 6 theorems, 66 equations, 13 figures, 6 tables, 3 algorithms.

Key Result

Theorem 1

Let $H$ be the Hessian of $\mathcal{L}_{\text{total}}$ with respect to $(\theta_{\mathrm{fg}},\theta_{\mathrm{bg}})$. Then and the following bounds hold. If all cross-domain attention logits are set to $-\infty$, then Otherwise, define the background-favoring margin $\gamma$ with Eq. eq:gamma_definition_no_sma. Assuming $\gamma\ge 0$, there exists $K>0$ such that

Figures (13)

  • Figure 1: The figure exhibits MILD (our method) can robustly handle three critical challenges in human erasing: Human–Human Occlusion, Human–Object Entanglement, and Human–Background interferences and can achieve clean and artifact-free removal results across diverse scenarios.
  • Figure 2: Peak signal of semantic leakage.
  • Figure 3: Overview of MILD. Given an input image and a set of target masks, the proposed Multi-Layer Diffusion (MILD) strategy performs human erasing by generating disentangled foreground layers and a background layer. A shared UNet with Layered LoRA enables efficient denoising across all branches, producing composable outputs. Spatially-Modulated Attention (SMA) injects adaptive biases to suppress semantic leakage across masked regions. Cross-attention is conditioned on the text prompt and Human Morphology Guidance (HMG) to guide instance-aware generation.
  • Figure 4: (a) HMG extracts fine-grained human priors from pose and parsing maps using two dedicated encoders. These features, combined with a spatial mask prior, form a unified latent representation that enriches the denoising process with human-centric and spatial cues, improving fidelity and identity consistency. (b) Illustration of the proposed Spatially-Modulated Attention (SMA) mechanism. For each query-key pair $(i, j)$, an spatial bias $\alpha_{st}$ is applied to the vanilla attention score $A_{ij}$, where $s = m_i$ and $t = m_j$ indicate their foreground/background status.
  • Figure 5: Qualitative results produced by MILD (ours) and other methods in real-world scenes. The masked regions (in red) and the corresponding removal results (in green) are highlighted.
  • ...and 8 more figures

Theorems & Definitions (8)

  • Theorem 1: see proof in Appendix \ref{['app:thm_multilayer']}
  • Corollary 1: see Appendix \ref{['app:cor_kappa']} for notation and proof
  • Corollary 2: see proof in Appendix \ref{['app:cor_er']}
  • Theorem 2: see proof in Appendix \ref{['app:thm_sma']}
  • Lemma 1
  • proof
  • Lemma 2: Restricted row-softmax Jacobian
  • proof