Customizing Text-to-Image Models with a Single Image Pair

Maxwell Jones; Sheng-Yu Wang; Nupur Kumari; David Bau; Jun-Yan Zhu

Customizing Text-to-Image Models with a Single Image Pair

Maxwell Jones, Sheng-Yu Wang, Nupur Kumari, David Bau, Jun-Yan Zhu

TL;DR

PairCustomization is proposed, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process, and can effectively learn style while avoiding overfitting to image content.

Abstract

Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.

Customizing Text-to-Image Models with a Single Image Pair

TL;DR

Abstract

Paper Structure (15 sections, 17 equations, 21 figures, 1 table)

This paper contains 15 sections, 17 equations, 21 figures, 1 table.

Introduction
Related Works
Method
Preliminary: Model Customization
Style Extraction from an image pair
Style Guidance
Experiments
Dataset
Baselines and Evaluation Metrics
Results
Discussion and Limitations
More Quantitative and Qualitative Results
Style Guidance Details
Real Image Editing Details
Implementation Details

Figures (21)

Figure 1: Given a single image pair, we present Pair Customization, a method for customizing a pre-trained text-to-image model and learning a new style from the image pair's stylistic difference. Our method can apply the learned stylistic difference to new input images while preserving the input structure. Compared to Dreambooth LoRAhu2022loraloraimplementation, a standard customization method that solely uses style images, our method effectively disentangles style and content, resulting in better structure, color preservation, and style application. Style image credit: https://www.instagram.com/parkhouse_art/.
Figure 2: Method overview. (Left) We disentangle style and content from an image pair by jointly training two low-rank adapters, StyleLoRA and ContentLoRA, representing style and content, respectively. Our training objective consists of two losses: The first loss fine-tunes ContentLoRA to reconstruct content image conditioned on a content prompt. The second loss encourages reconstructing the style image using both StyleLoRA and ContentLoRA conditioned on a style prompt, but we only optimize Style LoRA for this loss. (Right) At inference time, we only apply StyleLoRA to customize the model. Given the same noise seed, the customized model generates a stylized counterpart of the original pre-trained model output. V* is a fixed random rare token that is a prompt modifier for the content image. Style image credits: https://www.instagram.com/parkhouse_art/
Figure 3: Style guidance. We compare our style guidance and standard LoRA weight scaling dreamboothlora. Style guidance better preserves content when the style is applied. Blue and green stand for the LoRA weight scale and style guidance scale, respectively. More details of style guidance formulation are in Section \ref{['sec:cfgstyle']}.
Figure 4: Orthogonal adaptation. Enforcing row-space orthogonality between style and content LoRA improves image quality, where the images capture the style better and have fewer visual artifacts.
Figure 5: Result of our method compared to the strongest baselines. When only training with the style image as in DB LoRA, the image structure is not preserved, and overfitting occurs. While Concept Slider's training scheme gandikota2023concept uses both style and content images, it still exhibits overfitting and loss of structure in many cases. Our method preserves the structure of the input image while faithfully applying the desired style. We use style guidance strength $3$ and classifier guidance strength $5$. Style image credits: https://www.instagram.com/parkhouse_art/ (First row) and https://www.instagram.com/aaronhertzmann/?hl=en (Second row)
...and 16 more figures

Customizing Text-to-Image Models with a Single Image Pair

TL;DR

Abstract

Customizing Text-to-Image Models with a Single Image Pair

Authors

TL;DR

Abstract

Table of Contents

Figures (21)