AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

Longhui Yuan

AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

Longhui Yuan

Abstract

Multi-person identity-preserving generation requires binding multiple reference faces to specified locations under a text prompt. Strong identity/layout conditions often trigger copy-paste shortcuts and weaken prompt-driven controllability. We present AnyPhoto, a diffusion-transformer finetuning framework with (i) a RoPE-aligned location canvas plus location-aligned token pruning for spatial grounding, (ii) AdaLN-style identity-adaptive modulation from face-recognition embeddings for persistent identity injection, and (iii) identity-isolated attention to prevent cross-identity interference. Training combines conditional flow matching with an embedding-space face similarity loss, together with reference-face replacement and location-canvas degradations to discourage shortcuts. On MultiID-Bench, AnyPhoto improves identity similarity while reducing copy-paste tendency, with gains increasing as the number of identities grows. AnyPhoto also supports prompt-driven stylization with accurate placement, showing great potential application value.

AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

Abstract

Paper Structure (54 sections, 33 equations, 9 figures, 4 tables)

This paper contains 54 sections, 33 equations, 9 figures, 4 tables.

Introduction
Preliminaries
Flow Matching
Modulation in DiTs
AnyPhoto
Problem Definition
Location-Aligned Token Pruning
Identity-Adaptive Modulation
Identity-Isolated Attention
AnyPhoto Training
Avoiding copy-paste collapse.
Face similarity loss.
Overall objective.
Experiments
Experimental Setup
...and 39 more sections

Figures (9)

Figure 1: AnyPhoto is capable of generating identity-preserving, location-controlled, high-quality images conditioned on reference faces and a text prompt.
Figure 2: AnyPhoto. Location-Aligned Token Pruning constructs the input from the text prompt, noised image, and location reference. Reference face embeddings modulate the aligned tokens. Training uses conditional flow matching and face similarity losses.
Figure 3: Detailed components illustration of AnyPhoto.
Figure 4: Visual comparisons of AnyPhoto with baselines conditioned on 1/2/3/4 persons.
Figure 5: Qualitative Ablations. The first row shows generations without a specified style, while the second row demonstrates outputs conditioned on a "Line Art" style.
...and 4 more figures

AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

Abstract

AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

Authors

Abstract

Table of Contents

Figures (9)