Table of Contents
Fetching ...

AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

Longhui Yuan

Abstract

Multi-person identity-preserving generation requires binding multiple reference faces to specified locations under a text prompt. Strong identity/layout conditions often trigger copy-paste shortcuts and weaken prompt-driven controllability. We present AnyPhoto, a diffusion-transformer finetuning framework with (i) a RoPE-aligned location canvas plus location-aligned token pruning for spatial grounding, (ii) AdaLN-style identity-adaptive modulation from face-recognition embeddings for persistent identity injection, and (iii) identity-isolated attention to prevent cross-identity interference. Training combines conditional flow matching with an embedding-space face similarity loss, together with reference-face replacement and location-canvas degradations to discourage shortcuts. On MultiID-Bench, AnyPhoto improves identity similarity while reducing copy-paste tendency, with gains increasing as the number of identities grows. AnyPhoto also supports prompt-driven stylization with accurate placement, showing great potential application value.

AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

Abstract

Multi-person identity-preserving generation requires binding multiple reference faces to specified locations under a text prompt. Strong identity/layout conditions often trigger copy-paste shortcuts and weaken prompt-driven controllability. We present AnyPhoto, a diffusion-transformer finetuning framework with (i) a RoPE-aligned location canvas plus location-aligned token pruning for spatial grounding, (ii) AdaLN-style identity-adaptive modulation from face-recognition embeddings for persistent identity injection, and (iii) identity-isolated attention to prevent cross-identity interference. Training combines conditional flow matching with an embedding-space face similarity loss, together with reference-face replacement and location-canvas degradations to discourage shortcuts. On MultiID-Bench, AnyPhoto improves identity similarity while reducing copy-paste tendency, with gains increasing as the number of identities grows. AnyPhoto also supports prompt-driven stylization with accurate placement, showing great potential application value.
Paper Structure (54 sections, 33 equations, 9 figures, 4 tables)

This paper contains 54 sections, 33 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: AnyPhoto is capable of generating identity-preserving, location-controlled, high-quality images conditioned on reference faces and a text prompt.
  • Figure 2: AnyPhoto. Location-Aligned Token Pruning constructs the input from the text prompt, noised image, and location reference. Reference face embeddings modulate the aligned tokens. Training uses conditional flow matching and face similarity losses.
  • Figure 3: Detailed components illustration of AnyPhoto.
  • Figure 4: Visual comparisons of AnyPhoto with baselines conditioned on 1/2/3/4 persons.
  • Figure 5: Qualitative Ablations. The first row shows generations without a specified style, while the second row demonstrates outputs conditioned on a "Line Art" style.
  • ...and 4 more figures