Image Generation Based on Image Style Extraction
Shuochen Chang
TL;DR
This work tackles fine-grained style control in text-to-image generation by decoupling style from content through a three-stage training pipeline built around a style encoder and an 8-token style vector. It introduces Style30k-captions, a large dataset pairing images with content-focused captions and style labels, to train a style extraction module that aligns visual style with text embeddings. The pipeline comprises Stage 1 per-image style inversion, Stage 2 feed-forward pre-training of a CLIP-based style encoder and a projection layer, and Stage 3 end-to-end joint fine-tuning, enabling end-to-end style-conditioned generation from a single reference image. Experiments show the Stage 3 module achieves the highest style fidelity, with an efficient Stage 2 module approaching Stage 1 performance, demonstrating strong potential for personalized and instance-guided stylization without altering the underlying diffusion backbone.
Abstract
Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.
