Table of Contents
Fetching ...

Nested Attention: Semantic-aware Attention Values for Concept Personalization

Or Patashnik, Rinon Gal, Daniil Ostashev, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or

TL;DR

This work introduces Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model’s existing cross-attention layers, and generates query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.

Abstract

Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model's prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.

Nested Attention: Semantic-aware Attention Values for Concept Personalization

TL;DR

This work introduces Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model’s existing cross-attention layers, and generates query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.

Abstract

Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model's prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.
Paper Structure (33 sections, 4 equations, 20 figures, 1 table)

This paper contains 33 sections, 4 equations, 20 figures, 1 table.

Figures (20)

  • Figure 1: Our nested attention mechanism attaches a localized, expressive representation of a subject to a single text token. This approach improves identity preservation while maintaining the model's prior, and can combine multiple personalized concepts in a single image.
  • Figure 2: Method overview. The input image is passed through an encoder that produces multiple tokens to represent it. These tokens are projected to form the keys and values of the nested attention layers. The result of each nested attention layer is a new set of per-query values, $V_q^*$, which then replace the cross-attention values of the token $s^*$ representing the subject. One nested attention layer is added to each of the cross-attention layers of the model.
  • Figure 3: The nested attention mechanism. We replace the value of the token $s^*$ with the result of an attention operation between the query and the nested keys and values produced by the encoder, resulting in a query-dependent value.
  • Figure 4: We visualize the values $V_q[s^*]$ generated for a subject in two different layers, with a vanilla cross-attention, and with our nested approach. Vanilla layers use the same value to represent the subject throughout the entire image (column 3). Nested attention assigns a different subject-value per query (columns 4 and 5), encoding fine-grained semantic information.
  • Figure 5: Analyzing the query-dependent values ($V_q[s^*]$) from a nested attention layer. For three queries of the generated image (purple, orange, blue points), we first show their attention maps in a nested attention layer (graph). There, each point corresponds to a token produced by the encoder. In each graph, 1-2 tokens dominate the attention. To analyze the information encoded in the most dominant token, we show the Q-Former attention map of its corresponding learned query. These show the semantic alignment between the probed query, and the source of values assigned to it.
  • ...and 15 more figures