Table of Contents
Fetching ...

AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation

Junjie He, Yuxiang Tuo, Binghui Chen, Chongyang Zhong, Yifeng Geng, Liefeng Bo

TL;DR

AnyStory tackles unified personalization for single- and multi-subject image generation by combining a universal ReferenceNet-based encoder with a CLIP-based encoding path in an encode-then-route framework. It introduces a decoupled instance-aware router modeled as a lightweight segmentation decoder to localize subject conditioning in the latent space, enabling flexible and precise control over multiple subjects and their interactions with backgrounds. Training proceeds in two stages over large paired and unpaired datasets, and experiments show improved subject detail fidelity, text alignment, and robust multi-subject generation, demonstrating practical impact for narrative image synthesis. Limitations include the inability to personalize backgrounds and remaining copy-paste artifacts, with future work aiming to extend background control and further reduce artifacts.

Abstract

Recently, large-scale generative models have demonstrated outstanding text-to-image generation capabilities. However, generating high-fidelity personalized images with specific subjects still presents challenges, especially in cases involving multiple subjects. In this paper, we propose AnyStory, a unified approach for personalized subject generation. AnyStory not only achieves high-fidelity personalization for single subjects, but also for multiple subjects, without sacrificing subject fidelity. Specifically, AnyStory models the subject personalization problem in an "encode-then-route" manner. In the encoding step, AnyStory utilizes a universal and powerful image encoder, i.e., ReferenceNet, in conjunction with CLIP vision encoder to achieve high-fidelity encoding of subject features. In the routing step, AnyStory utilizes a decoupled instance-aware subject router to accurately perceive and predict the potential location of the corresponding subject in the latent space, and guide the injection of subject conditions. Detailed experimental results demonstrate the excellent performance of our method in retaining subject details, aligning text descriptions, and personalizing for multiple subjects. The project page is at https://aigcdesigngroup.github.io/AnyStory/ .

AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation

TL;DR

AnyStory tackles unified personalization for single- and multi-subject image generation by combining a universal ReferenceNet-based encoder with a CLIP-based encoding path in an encode-then-route framework. It introduces a decoupled instance-aware router modeled as a lightweight segmentation decoder to localize subject conditioning in the latent space, enabling flexible and precise control over multiple subjects and their interactions with backgrounds. Training proceeds in two stages over large paired and unpaired datasets, and experiments show improved subject detail fidelity, text alignment, and robust multi-subject generation, demonstrating practical impact for narrative image synthesis. Limitations include the inability to personalize backgrounds and remaining copy-paste artifacts, with future work aiming to extend background control and further reduce artifacts.

Abstract

Recently, large-scale generative models have demonstrated outstanding text-to-image generation capabilities. However, generating high-fidelity personalized images with specific subjects still presents challenges, especially in cases involving multiple subjects. In this paper, we propose AnyStory, a unified approach for personalized subject generation. AnyStory not only achieves high-fidelity personalization for single subjects, but also for multiple subjects, without sacrificing subject fidelity. Specifically, AnyStory models the subject personalization problem in an "encode-then-route" manner. In the encoding step, AnyStory utilizes a universal and powerful image encoder, i.e., ReferenceNet, in conjunction with CLIP vision encoder to achieve high-fidelity encoding of subject features. In the routing step, AnyStory utilizes a decoupled instance-aware subject router to accurately perceive and predict the potential location of the corresponding subject in the latent space, and guide the injection of subject conditions. Detailed experimental results demonstrate the excellent performance of our method in retaining subject details, aligning text descriptions, and personalizing for multiple subjects. The project page is at https://aigcdesigngroup.github.io/AnyStory/ .
Paper Structure (14 sections, 8 equations, 8 figures, 1 table)

This paper contains 14 sections, 8 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Example generations I from AnyStory. Our approach demonstrates excellence in preserving subject details, aligning text descriptions, and personalizing multiple subjects. Here, the image with a plain white background serves as the reference. For more examples, please refer to Fig. \ref{['fig:example-2']} and Fig. \ref{['fig:example-3']}.
  • Figure 2: Overview of AnyStory framework. AnyStory follows the "encode-then-route" conditional generation paradigm. It first utilizes a simplified ReferenceNet combined with a CLIP vision encoder to encode the subject (Sec. \ref{['sec:3.2']}), and then employs a decoupled instance-aware subject router to guide the subject condition injection (Sec. \ref{['sec:3.3']}). The training process is divided into two stages: the subject encoder training stage and the router training stage (Sec. \ref{['sec:3.4']}). For brevity, we omit the text conditional branch here.
  • Figure 3: Effect of ReferenceNet encoding. The ReferenceNet encoder enhances the preservation of subject details.
  • Figure 4: The effectiveness of the router. The router restricts the influence areas of the subject conditions, thereby avoiding the blending of characteristics between multiple subjects and improving the quality of the generated images.
  • Figure 5: Visualization of routing maps. We visualize the routing maps within each cross-attention layer in the U-Net at different diffusion time steps. There are a total of 70 cross-attention layers in the SDXL U-Net, and we sequentially display them in each subfigure in a top-to-bottom and left-to-right order (yellow represents the effective region). We utilize $T=25$ steps of EDM sampling. Each complete row corresponds to one entity. The background routing map has been ignored, which is the complement of the routing maps of all subjects. Best viewed in color and zoomed in.
  • ...and 3 more figures