Table of Contents
Fetching ...

Implicit Style-Content Separation using B-LoRA

Yarden Frenkel, Yael Vinker, Ariel Shamir, Daniel Cohen-Or

TL;DR

This work tackles the challenge of disentangling style and content from a single image to enable flexible image stylization. It introduces B-LoRA, a lightweight approach that jointly trains low-rank adapters for two SDXL transformer blocks ($W_0^4$ and $W_0^5$) to implicitly separate content and style, achieving high-quality style transfer, text-guided stylization, and consistent style generation without heavy fine-tuning. The method reduces training and storage requirements and enables reusability of learned components across tasks, outperforming several baselines in both qualitative and quantitative evaluations, including a user study. The findings suggest a robust, efficient pathway for per-image style-content decomposition with potential extensions to more granular sub-components and multi-image personalization.

Abstract

Image stylization involves manipulating the visual appearance and texture (style) of an image while preserving its underlying objects, structures, and concepts (content). The separation of style and content is essential for manipulating the image's style independently from its content, ensuring a harmonious and visually pleasing result. Achieving this separation requires a deep understanding of both the visual and semantic characteristics of images, often necessitating the training of specialized models or employing heavy optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA (Low-Rank Adaptation) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of SDXL combined with LoRA, we find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. Consolidating the training into only two blocks and separating style and content allows for significantly improving style manipulation and overcoming overfitting issues often associated with model fine-tuning. Once trained, the two B-LoRAs can be used as independent components to allow various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing.

Implicit Style-Content Separation using B-LoRA

TL;DR

This work tackles the challenge of disentangling style and content from a single image to enable flexible image stylization. It introduces B-LoRA, a lightweight approach that jointly trains low-rank adapters for two SDXL transformer blocks ( and ) to implicitly separate content and style, achieving high-quality style transfer, text-guided stylization, and consistent style generation without heavy fine-tuning. The method reduces training and storage requirements and enables reusability of learned components across tasks, outperforming several baselines in both qualitative and quantitative evaluations, including a user study. The findings suggest a robust, efficient pathway for per-image style-content decomposition with potential extensions to more granular sub-components and multi-image personalization.

Abstract

Image stylization involves manipulating the visual appearance and texture (style) of an image while preserving its underlying objects, structures, and concepts (content). The separation of style and content is essential for manipulating the image's style independently from its content, ensuring a harmonious and visually pleasing result. Achieving this separation requires a deep understanding of both the visual and semantic characteristics of images, often necessitating the training of specialized models or employing heavy optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA (Low-Rank Adaptation) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of SDXL combined with LoRA, we find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. Consolidating the training into only two blocks and separating style and content allows for significantly improving style manipulation and overcoming overfitting issues often associated with model fine-tuning. Once trained, the two B-LoRAs can be used as independent components to allow various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing.
Paper Structure (35 sections, 2 equations, 30 figures, 2 tables)

This paper contains 35 sections, 2 equations, 30 figures, 2 tables.

Figures (30)

  • Figure 1: Examples of image stylization generated with our approach. The content image is shown on the left. We show here three results of image style transfer based on a reference style, one (on the right) based on a guiding text prompt. Note that our method requires only a single image, and preserves the image's content and structure well while applying the desired style.
  • Figure 2: Illustration of SDXL architecture and our text-based analysis. To examine the effect of the i'th transformer block on the generated image, we inject a different text prompt $\hat{p}$ to it, while $p$ is injected into all other blocks.
  • Figure 3: Prompt injection effect on the generated image. On the left, we demonstrate how blocks 2 and 4 affect the content in the generated image (turning into a tiger), whereas the rightmost image shows that injecting $\hat{p}$ to a block $i \neq 2,4$ has no effect on the generated image. On the right we show how the fifth block controls the generated image's color.
  • Figure 4: Comparison of training B-LoRAs for the input images shown on the left for $W_0^2,W_0^5$ (middle) and $W_0^4,W_0^5$ (right). For each pair of trained LoRA weights, we show the results of applying both together (to reconstruct the input image) and applying the content layer separately (i.e. using only $\Delta W^2$ and $\Delta W^4$). The results demonstrate that $\Delta W^4$ better captures the fine details of the input object.
  • Figure 5: B-LoRA for Image Stylization. (1) To stylize a given content image $I_c$ w.r.t an given style image reference $I_s$, we train our B-LoRAs for both images and then combine $\Delta W_c^4$ and $\Delta W_s^5$ to a single adapted model. (2) For text-based stylization we simply plug only the trained $\Delta W_c^4$ to adapt the model and then use the desired text prompt during inference. (3) The learned style weights $\Delta W_c^5$ can be also used as is to adjust the backbone model to produce images with the style of $I_c$.
  • ...and 25 more figures