Table of Contents
Fetching ...

PHAC: Promptable Human Amodal Completion

Seung Young Noh, Ju Yong Chang

Abstract

Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model until they obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on the HAC and PGPIS benchmarks show that our approach yields more physically plausible and higher-quality completions, while significantly improving prompt alignment compared with existing amodal completion and pose-guided synthesis methods.

PHAC: Promptable Human Amodal Completion

Abstract

Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model until they obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on the HAC and PGPIS benchmarks show that our approach yields more physically plausible and higher-quality completions, while significantly improving prompt alignment compared with existing amodal completion and pose-guided synthesis methods.
Paper Structure (34 sections, 17 equations, 12 figures, 11 tables)

This paper contains 34 sections, 17 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Assume that, as in (A), a single input image is paired with multiple target poses. The desired behavior is that each generated image aligns with its target pose while preserving the visible appearance. However, prior human amodal completion (HAC) methods such as SDHDO noh2025sdhdo often fail to align with the specified pose, as shown in (B), despite leveraging pose information. Pose-guided person image synthesis (PGPIS) methods such as MCLD liu2025mcld align well with the pose condition but degrade the visible appearance of the input, especially around the shoes, as shown in (C). While SDHDO preserves the visible appearance better than MCLD, its results remain blurry and show noticeable degradation. In contrast, given the multi-condition inputs in (A), our method simultaneously aligns with the target pose and preserves the visible appearance, producing the user-intended human image.
  • Figure 2: User prompts. Users specify the intended pose with point prompts, which we use to condition the model. For the pose prompt $p_\text{po}$, we use OpenPose cao2019openpose to detect the visible joints, show them to the user, who then adds the missing joints for the desired pose. Alternatively, the user selects two points to specify a bbox prompt, choosing either an interest-region bbox $p_\text{ib}$ or an entire-region bbox $p_\text{eb}$. To provide fine-grained control, the pose and bbox prompts can be combined, yielding $p_\text{poib}$ or $p_\text{poeb}$. To make effective use of the spatial information, we convert the point coordinates into a prompt image $I_\text{p}$ and use it as a conditioning input.
  • Figure 3: Method overview. Given an incomplete image $I_\text{ic}$ and a user prompt $P$, our PHAC framework processes them through (A) coarse image generation and (B, C) a refinement stage. In (A), the denoising U-Net$\epsilon_\text{cig}$ starts from random noise and denoises it for $T$ steps to generate a coarse complete image $I_\text{cc}$, conditioned on a prompt image $I_\text{p}$ (see Fig. \ref{['fig2:prompt']}); $I_\text{p}$ is fed to a prompt-specific ControlNet$\Phi_\text{CN}$ to provide conditioning, and only the cross-attention blocks of $\epsilon_\text{cig}$ are fine-tuned to preserve the pre-trained prior. In (B), invisible mask prediction U-Net$\mathcal{U}_\text{iv}$ predicts an invisible mask $M_\text{iv}$, which is then dilated to $M_\text{iv}^\prime$. In (C), we construct the base composite $I_\text{base}$ and add low-magnitude noise to the invisible region only. The refinement network$\Phi_\text{RF}$ then takes the noisy $I_\text{base}$ as input and outputs the refined completion $I_\text{rc}$, preserving the visible region while refining the coarse completion and mitigating boundary artifacts.
  • Figure 4: Qualitative comparison on the AHP test dataset. (A) Partial RGB results; (B) Generated images with different seeds. PGPIS baselines (PIDM, MCLD) frequently hallucinate training set appearances. Amodal completion baselines (pix2gestalt, SDHDO) do not preserve the visible appearance and often violate the pose condition. In contrast, our approach yields consistent pose alignment across seeds and preserves the visible regions.
  • Figure S6: Construction pipeline of the OccThuman2.0 dataset. For each THuman2.0 mesh, we render 10 views and apply per-view data augmentations and background-occlusion compositing to generate 10 composited images. Repeating this process across all 526 meshes produces a total of 5,260 composited images.
  • ...and 7 more figures