Table of Contents
Fetching ...

StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding

Junseo Park, Beomseok Ko, Hyeryung Jang

TL;DR

StyleForge introduces dual-binding personalization for text-to-image synthesis, enabling high-fidelity rendering of arbitrary artistic styles. Single-StyleForge binds a target style to a dedicated token while using auxiliary images to bound crucial elements like people, and Multi-StyleForge adds a second token to separate background and subject characteristics, improving text–image alignment. Across six styles, the methods achieve superior FID/KID for image quality and higher CLIP scores for prompt fidelity compared with baseline personalization techniques such as DreamBooth and Textual Inversion. The approach reduces data requirements and enhances generalization, enabling practical style-consistent generation and a flexible framework for style-driven image synthesis.

Abstract

Recent advancements in text-to-image models, such as Stable Diffusion, have showcased their ability to create visual images from natural language prompts. However, existing methods like DreamBooth struggle with capturing arbitrary art styles due to the abstract and multifaceted nature of stylistic attributes. We introduce Single-StyleForge, a novel approach for personalized text-to-image synthesis across diverse artistic styles. Using approximately 15 to 20 images of the target style, Single-StyleForge establishes a foundational binding of a unique token identifier with a broad range of attributes of the target style. Additionally, auxiliary images are incorporated for dual binding that guides the consistent representation of crucial elements such as people within the target style. Furthermore, we present Multi-StyleForge, which enhances image quality and text alignment by binding multiple tokens to partial style attributes. Experimental evaluations across six distinct artistic styles demonstrate significant improvements in image quality and perceptual fidelity, as measured by FID, KID, and CLIP scores.

StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding

TL;DR

StyleForge introduces dual-binding personalization for text-to-image synthesis, enabling high-fidelity rendering of arbitrary artistic styles. Single-StyleForge binds a target style to a dedicated token while using auxiliary images to bound crucial elements like people, and Multi-StyleForge adds a second token to separate background and subject characteristics, improving text–image alignment. Across six styles, the methods achieve superior FID/KID for image quality and higher CLIP scores for prompt fidelity compared with baseline personalization techniques such as DreamBooth and Textual Inversion. The approach reduces data requirements and enhances generalization, enabling practical style-consistent generation and a flexible framework for style-driven image synthesis.

Abstract

Recent advancements in text-to-image models, such as Stable Diffusion, have showcased their ability to create visual images from natural language prompts. However, existing methods like DreamBooth struggle with capturing arbitrary art styles due to the abstract and multifaceted nature of stylistic attributes. We introduce Single-StyleForge, a novel approach for personalized text-to-image synthesis across diverse artistic styles. Using approximately 15 to 20 images of the target style, Single-StyleForge establishes a foundational binding of a unique token identifier with a broad range of attributes of the target style. Additionally, auxiliary images are incorporated for dual binding that guides the consistent representation of crucial elements such as people within the target style. Furthermore, we present Multi-StyleForge, which enhances image quality and text alignment by binding multiple tokens to partial style attributes. Experimental evaluations across six distinct artistic styles demonstrate significant improvements in image quality and perceptual fidelity, as measured by FID, KID, and CLIP scores.
Paper Structure (20 sections, 4 equations, 12 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 4 equations, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: Text-to-image synthesis via Single/Multi-StyleForge personalized in various art styles, from realism to pixel-art. The generated images demonstrate our approach's ability to create aligned and high-fidelity images in each target style (top row) by using a unique token ("[V] style") in text prompts.
  • Figure 2: The architecture of Single-StyleForge. StyleRef images of the target style, paired with text prompt ("a photo of [V] style"), and Aux images, paired with the prompt ("a photo of style"), are provided as input images. After fine-tuning, the text-to-image model can generate various images of the target style with the guidance of text prompts.
  • Figure 3: Comparison of different compositions of StyleRef images. In (a) and (b), StyleRef images consisting of only background and person, respectively, show that the target style is learned based on biased information, failing to include a girl in (a). The generated images in (c) closely align with the prompts.
  • Figure 4: Attention maps about "[V]" and "style" token in prompt. As we designed, "[V]" is focusing on a relatively whole area, and "style" is focusing on people. It was made through edited Prompt-to-Prompt prompt-to-prompt.
  • Figure 5: Ablation study of Aux images ${\mathbf x}^\text{aux}$ for six target styles, displaying FID, KID ($\times 10^3$), and CLIP scores.
  • ...and 7 more figures