StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding
Junseo Park, Beomseok Ko, Hyeryung Jang
TL;DR
StyleForge introduces dual-binding personalization for text-to-image synthesis, enabling high-fidelity rendering of arbitrary artistic styles. Single-StyleForge binds a target style to a dedicated token while using auxiliary images to bound crucial elements like people, and Multi-StyleForge adds a second token to separate background and subject characteristics, improving text–image alignment. Across six styles, the methods achieve superior FID/KID for image quality and higher CLIP scores for prompt fidelity compared with baseline personalization techniques such as DreamBooth and Textual Inversion. The approach reduces data requirements and enhances generalization, enabling practical style-consistent generation and a flexible framework for style-driven image synthesis.
Abstract
Recent advancements in text-to-image models, such as Stable Diffusion, have showcased their ability to create visual images from natural language prompts. However, existing methods like DreamBooth struggle with capturing arbitrary art styles due to the abstract and multifaceted nature of stylistic attributes. We introduce Single-StyleForge, a novel approach for personalized text-to-image synthesis across diverse artistic styles. Using approximately 15 to 20 images of the target style, Single-StyleForge establishes a foundational binding of a unique token identifier with a broad range of attributes of the target style. Additionally, auxiliary images are incorporated for dual binding that guides the consistent representation of crucial elements such as people within the target style. Furthermore, we present Multi-StyleForge, which enhances image quality and text alignment by binding multiple tokens to partial style attributes. Experimental evaluations across six distinct artistic styles demonstrate significant improvements in image quality and perceptual fidelity, as measured by FID, KID, and CLIP scores.
