Table of Contents
Fetching ...

PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation

Xinting Hu, Haoran Wang, Jan Eric Lenssen, Bernt Schiele

TL;DR

PersonaHOI targets realistic HOI generation with a personalized face without requiring additional training. It couples a personalized face diffusion model with a StableDiffusion branch and uses Cross-Attention Constraint, Latent Merge, and Residual Merge to preserve facial identity while embedding HOI layouts from the SD pathway. The framework achieves superior interaction realism and identity fidelity across HOI-focused and general PFG tasks, validated by a novel interaction-alignment metric and comprehensive ablations. Its training-free design and compatibility with ControlNet further enhance practicality for real-world personalized content with complex interactions.

Abstract

We introduce PersonaHOI, a training- and tuning-free framework that fuses a general StableDiffusion model with a personalized face diffusion (PFD) model to generate identity-consistent human-object interaction (HOI) images. While existing PFD models have advanced significantly, they often overemphasize facial features at the expense of full-body coherence, PersonaHOI introduces an additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By incorporating cross-attention constraints in the PFD branch and spatial merging at both latent and residual levels, PersonaHOI preserves personalized facial details while ensuring interactive non-facial regions. Experiments, validated by a novel interaction alignment metric, demonstrate the superior realism and scalability of PersonaHOI, establishing a new standard for practical personalized face with HOI generation. Our code will be available at https://github.com/JoyHuYY1412/PersonaHOI

PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation

TL;DR

PersonaHOI targets realistic HOI generation with a personalized face without requiring additional training. It couples a personalized face diffusion model with a StableDiffusion branch and uses Cross-Attention Constraint, Latent Merge, and Residual Merge to preserve facial identity while embedding HOI layouts from the SD pathway. The framework achieves superior interaction realism and identity fidelity across HOI-focused and general PFG tasks, validated by a novel interaction-alignment metric and comprehensive ablations. Its training-free design and compatibility with ControlNet further enhance practicality for real-world personalized content with complex interactions.

Abstract

We introduce PersonaHOI, a training- and tuning-free framework that fuses a general StableDiffusion model with a personalized face diffusion (PFD) model to generate identity-consistent human-object interaction (HOI) images. While existing PFD models have advanced significantly, they often overemphasize facial features at the expense of full-body coherence, PersonaHOI introduces an additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By incorporating cross-attention constraints in the PFD branch and spatial merging at both latent and residual levels, PersonaHOI preserves personalized facial details while ensuring interactive non-facial regions. Experiments, validated by a novel interaction alignment metric, demonstrate the superior realism and scalability of PersonaHOI, establishing a new standard for practical personalized face with HOI generation. Our code will be available at https://github.com/JoyHuYY1412/PersonaHOI
Paper Structure (27 sections, 5 equations, 12 figures, 7 tables)

This paper contains 27 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Examples of Personalized Face with Human-Object Interaction (HOI) Generation. We present PersonaHOI, a training- and tuning-free framework built on existing diffusion models. Using a single reference image and diverse HOI prompts, PersonaHOI generates identity-consistent human-object interactions compared to FastComposer 54xiao2023fastcomposer. PersonaHOI can further seamlessly integrate varied contexts, styles, accessories, and multi-person scenarios, ensuring scalability and practicality for real-world applications.
  • Figure 2: (a) The spatial layout of StableDiffusion guides PersonaHOI to generate personalized content with coherent human-object interactions (HOI). (b) Analysis of identity injection timing in PFD models. We use FastComposer 54xiao2023fastcomposer for diffusion model generation. Injecting face representation at the start of image generation preserves facial details but lacks coherent HOI, while delayed injection continuously deviates from the original identity, resulting in random human features and meaningless human-object interactions.
  • Figure 3: Overview of Our Proposed Framework, PersonaHOI. The architecture integrates a personalized face diffusion (PFD) model with an additional StableDiffusion (SD) branch. First, SD generates an image ($I_{SD}$) from a text prompt and noisy latent representation ($z_T$), which is decoded and segmented to produce a head mask. Next, SD and PFD run in parallel from the same $z_T$. At every timestep $t$, the head mask guides the Cross-Attention Constraint in PFD and merging modules (Latent Merge and Residual Merge) to merge interaction-relevant features from SD with identity-specific details from PFD. Iteratively, this process introduces HOI context to personalized face generation in a training&tuning-free manner.
  • Figure 4: Illustration of Residual Merge. In each residual layer, Residual Merge operates within the U-Net skip connections, utilizing a head mask to guide the integration of high-frequency identity details from PFD residuals and low-frequency interaction layouts from SD residuals. The merged residuals are then concatenated to the corresponding bottleneck features from PFD.
  • Figure 5: Qualitative Examples of PersonaHOI and Baseline Models. Comparison of baseline models (FastComposer 54xiao2023fastcomposer, IP-Adapter 40ye2023ip, PhotoMaker 28li2023photomaker) and their PersonaHOI-enhanced results for diverse human-object interaction prompts.
  • ...and 7 more figures