Table of Contents
Fetching ...

Mitigating Semantic Collapse in Generative Personalization with Test-Time Embedding Adjustment

Anh Bui, Trang Vu, Trung Le, Junae Kim, Tamas Abraham, Rollin Omari, Amar Kaur, Dinh Phung

TL;DR

This paper proposes a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem in generative personalization.

Abstract

In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept ($V$) gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like "a photo of $V$ wearing glasses and playing guitar" into simpler, less contextually rich forms such as "a photo of $V$" but also leads to simplified output images that fail to capture the intended concept. We identify the root cause as unconstrained optimisation, which allows the learned embedding $V$ to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at https://github.com/tuananhbui89/Embedding-Adjustment

Mitigating Semantic Collapse in Generative Personalization with Test-Time Embedding Adjustment

TL;DR

This paper proposes a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem in generative personalization.

Abstract

In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept () gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like "a photo of wearing glasses and playing guitar" into simpler, less contextually rich forms such as "a photo of " but also leads to simplified output images that fail to capture the intended concept. We identify the root cause as unconstrained optimisation, which allows the learned embedding to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at https://github.com/tuananhbui89/Embedding-Adjustment

Paper Structure

This paper contains 50 sections, 12 equations, 31 figures, 7 tables.

Figures (31)

  • Figure 1: Our Test-time Embedding Adjustment (TEA) method consistently enhances text-image alignment across diverse personalization approaches (Textual Inversion, DreamBooth, and their variants) and architectures (Stable Diffusion, Flux). Notably, TEA also counteracts the anti-personalization effect of Anti-DreamBooth and restores the protected concept.
  • Figure 2: (a/left) The inter-set distance $d(P_{V^*}, P_c)$ and intra-set distance $d(P_{V^*}, P_{V^*})$ over the personalization process, and (b) The distance between all possible pairs of sets, notably $d(P_{V^*}, P_{V^*}^{\text{simple}})$.
  • Figure 3: Analysis of the SCP on TI (left) and DB (right). Alignment with the ground-truth image ($S(\hat{x}, x_{gt}) - \lozenge$) increases over time, while alignment with the contextual part ($S(\hat{x}, p) - \square$) decreases.
  • Figure 4: Left: The distribution of the norm of the token embedding $M$ including special token $V^*$, Right: The semantic drift of $V^*$ in term of magnitude and direction over time. The adjusted embedding $V^*_{\text{adjusted}}$ is obtained by using TEA with $\alpha = 0.2$ and $\beta = 1.5$. The same phenomenon is observed in DreamBooth as shown in Figure \ref{['fig:semantic_drift_db_lora']}.
  • Figure 5: (left) TEA framework that adjusts the embedding on inference time where both U-Net and text encoder are just personalized pre-trained models. (right) the two stages of TEA: normalization and rotation with SLERP.
  • ...and 26 more figures