Table of Contents
Fetching ...

Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

Saman Motamed, Danda Pani Paudel, Luc Van Gool

TL;DR

Lego is introduced, a textual inversion method designed to invert subject-entangled concepts from a few example images that disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts.

Abstract

Text-to-Image (T2I) models excel at synthesizing concepts such as nouns, appearances, and styles. To enable customized content creation based on a few example images of a concept, methods such as Textual Inversion and DreamBooth invert the desired concept and enable synthesizing it in new scenes. However, inverting personalized concepts that go beyond object appearance and style (adjectives and verbs) through natural language remains a challenge. Two key characteristics of these concepts contribute to the limitations of current inversion methods. 1) Adjectives and verbs are entangled with nouns (subject) and can hinder appearance-based inversion methods, where the subject appearance leaks into the concept embedding, and 2) describing such concepts often extends beyond single word embeddings. In this study, we introduce Lego, a textual inversion method designed to invert subject-entangled concepts from a few example images. Lego disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts. In a thorough user study, Lego-generated concepts were preferred over 70% of the time when compared to the baseline in terms of authentically generating concepts according to a reference. Additionally, visual question answering using an LLM suggested Lego-generated concepts are better aligned with the text description of the concept.

Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

TL;DR

Lego is introduced, a textual inversion method designed to invert subject-entangled concepts from a few example images that disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts.

Abstract

Text-to-Image (T2I) models excel at synthesizing concepts such as nouns, appearances, and styles. To enable customized content creation based on a few example images of a concept, methods such as Textual Inversion and DreamBooth invert the desired concept and enable synthesizing it in new scenes. However, inverting personalized concepts that go beyond object appearance and style (adjectives and verbs) through natural language remains a challenge. Two key characteristics of these concepts contribute to the limitations of current inversion methods. 1) Adjectives and verbs are entangled with nouns (subject) and can hinder appearance-based inversion methods, where the subject appearance leaks into the concept embedding, and 2) describing such concepts often extends beyond single word embeddings. In this study, we introduce Lego, a textual inversion method designed to invert subject-entangled concepts from a few example images. Lego disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts. In a thorough user study, Lego-generated concepts were preferred over 70% of the time when compared to the baseline in terms of authentically generating concepts according to a reference. Additionally, visual question answering using an LLM suggested Lego-generated concepts are better aligned with the text description of the concept.
Paper Structure (29 sections, 3 equations, 33 figures, 2 tables)

This paper contains 29 sections, 3 equations, 33 figures, 2 tables.

Figures (33)

  • Figure 1: We showcase Lego's ability to invert concepts of "frozen in ice", "burnt and melted", and "closed eyes" using as few as just four example images (two with and two without the concept). Our results cover text-to-image models, including LDM, Stable Diffusion 2.1, Attend and Excite, and closed-source DALL.E 2. Notably, Lego faithfully represents intended personalized concepts, even with a less capable backbone (LDM), while more powerful models such as DALL.E, though artistically impressive, do not consistently capture the same.
  • Figure 2: A) We showcase our definition for personalized concept inversion. While SD 2.1 and DALL.E 2 and 3 create their version of a "frozen Lego horse in ice", we are are not only interested in synthesizing the concept, but also doing so such that it follows the example concept of the reference image (personalized) where the concept has unique characteristics ( e.g. cracks and trapped bubbles in the ice). B) We visualize 4 concepts when using LDM with text description of the concept (bottom row) compared to visualizing the concepts after performing Lego inversion using reference images (visualized at the bottom of each Lego generated image) of the concept (top row).
  • Figure 3: Textual Inversion is not able to learn the concept of "closed eyes" from multiple subjects without the appearance of the sample subjects leaking into the concept embedding.
  • Figure 4: Right figure is an overview of Lego's objective and the Subject Separation step. Learning an explicit embedding to represents the subject (Rubik's cube) allows the concept ("melted") embedding to dissociate from the subject's appearance features, as visualized by $<$concept$>$ embedding (highlighted in blue). The left figure depicts the framework that uses concept only images (same setting as TI, DreamBooth, etc.). In this setting, the subject's features leak into the $<$concept$>$ embedding, (highlighted in orange and blue), as shown by the concept's visualization which evinces both melting effects and Rubik's cube features.
  • Figure 5: An overview of Lego's framework. From left to right, during embedding optimization, Lego dedicates an embedding $<\!subj\!>$ for inverting the subject $\mathbf{S}_e$ in the exemplar images $\mathcal{I}_C$ and $\mathcal{I}_{\overline{C}}$. This stops appearance leakage to the concept embeddings. Each concept embedding ( $<\!cpt_{i/j}\!>$) is separately steered towards user defined words ($\mathcal{P}_{i/j}$) that correspond to the embedding's semantic word and away from antonyms of those words ($\mathcal{N}_{i/j}$). After the inversion, the learned embeddings can together be applied to different target subjects $\mathbf{S}_t$ ("Statue" and "Teddy bear") to manifest the concept in new scenes.
  • ...and 28 more figures