Table of Contents
Fetching ...

MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

Kuan-Chieh Wang, Daniil Ostashev, Yuwei Fang, Sergey Tulyakov, Kfir Aberman

TL;DR

MoA introduces a dual-attention framework for personalized image generation that preserves the prior model while learning a subject-specific branch. A router blends outputs from a fixed, prior-attention path and a trainable personalization path, enabling disentangled subject-context control and robust multi-subject interactions without layout constraints. The method uses multimodal prompts and layer-wise routing to maintain background fidelity while injecting subject information, and it remains compatible with existing diffusion techniques like ControlNet and DDIM Inversion. Empirical results demonstrate strong subject-context disentanglement, high image quality, and versatile applications such as subject morphing and real-image subject swapping, with limitations acknowledged on facial detail and complex scenes.

Abstract

We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs), MoA distributes the generation workload between two attention pathways: a personalized branch and a non-personalized prior branch. MoA is designed to retain the original model's prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A novel routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model's pre-existing capability and the newly augmented personalized intervention, thereby offering a more disentangled subject-context control that was previously unattainable. Project page: https://snap-research.github.io/mixture-of-attention

MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

TL;DR

MoA introduces a dual-attention framework for personalized image generation that preserves the prior model while learning a subject-specific branch. A router blends outputs from a fixed, prior-attention path and a trainable personalization path, enabling disentangled subject-context control and robust multi-subject interactions without layout constraints. The method uses multimodal prompts and layer-wise routing to maintain background fidelity while injecting subject information, and it remains compatible with existing diffusion techniques like ControlNet and DDIM Inversion. Empirical results demonstrate strong subject-context disentanglement, high image quality, and versatile applications such as subject morphing and real-image subject swapping, with limitations acknowledged on facial detail and complex scenes.

Abstract

We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs), MoA distributes the generation workload between two attention pathways: a personalized branch and a non-personalized prior branch. MoA is designed to retain the original model's prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A novel routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model's pre-existing capability and the newly augmented personalized intervention, thereby offering a more disentangled subject-context control that was previously unattainable. Project page: https://snap-research.github.io/mixture-of-attention
Paper Structure (42 sections, 8 equations, 22 figures, 3 tables)

This paper contains 42 sections, 8 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Mixture-of-Attention. Unlike the standard attention mechanism (left), MoA is a dual attention pathways that contains a trainable personalized attention branch and a non-personalized fixed attention branch that is copied from the original model (prior attention). In addition, a routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation.
  • Figure 2: Comparing image variations. In contrast to Fastcomposer xiao2023fastcomposer, our method (MoA) is able to generate images with diverse compositions, and foster interaction of the subject with what is described in the text prompt.
  • Figure 3: Text-to-Image Diffusion Models with MoA. Our architecture expands the original diffusion U-Net by replacing each attention block (self and cross) with MoA. In each inference step, a MoA block receives the input image features and passes them to the router, which decides how to balance the weights between the output of the personalized attention and the output of the original attention block. Note that the images of the subjects are injected only through the personalized attention branch; hence, during training, where the router is encouraged to prioritize the prior branch, the result is that only the minimal necessary information required for generating the subjects will be transferred to the personalized attention.
  • Figure 4: Multimodal prompts. Our architecture enables us to inject images as visual tokens that are part of the text prompt, where each visual token is attached to a text encoding of a specific token.
  • Figure 5: Router Visualization. Our router learns to generate soft segmentation maps per time step in the diffusion process and per layer. Distinct parts of the subjects, in different resolutions, are highlighted across various time steps and layers.
  • ...and 17 more figures