Table of Contents
Fetching ...

Image Generation Based on Image Style Extraction

Shuochen Chang

TL;DR

This work tackles fine-grained style control in text-to-image generation by decoupling style from content through a three-stage training pipeline built around a style encoder and an 8-token style vector. It introduces Style30k-captions, a large dataset pairing images with content-focused captions and style labels, to train a style extraction module that aligns visual style with text embeddings. The pipeline comprises Stage 1 per-image style inversion, Stage 2 feed-forward pre-training of a CLIP-based style encoder and a projection layer, and Stage 3 end-to-end joint fine-tuning, enabling end-to-end style-conditioned generation from a single reference image. Experiments show the Stage 3 module achieves the highest style fidelity, with an efficient Stage 2 module approaching Stage 1 performance, demonstrating strong potential for personalized and instance-guided stylization without altering the underlying diffusion backbone.

Abstract

Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.

Image Generation Based on Image Style Extraction

TL;DR

This work tackles fine-grained style control in text-to-image generation by decoupling style from content through a three-stage training pipeline built around a style encoder and an 8-token style vector. It introduces Style30k-captions, a large dataset pairing images with content-focused captions and style labels, to train a style extraction module that aligns visual style with text embeddings. The pipeline comprises Stage 1 per-image style inversion, Stage 2 feed-forward pre-training of a CLIP-based style encoder and a projection layer, and Stage 3 end-to-end joint fine-tuning, enabling end-to-end style-conditioned generation from a single reference image. Experiments show the Stage 3 module achieves the highest style fidelity, with an efficient Stage 2 module approaching Stage 1 performance, demonstrating strong potential for personalized and instance-guided stylization without altering the underlying diffusion backbone.

Abstract

Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.

Paper Structure

This paper contains 31 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Visual evaluation of reconstruction quality at different training steps in Stage 1. Each row shows one example. The images within each row, from left to right, are: Original Image, Reconstruction at Step 50, Step 100, Step 150, and Step 200. In most cases, the Step 200 reconstruction is the most faithful.
  • Figure 2: t-SNE visualization of features extracted by different encoders. Each color represents a different style category. Our pre-trained style encoder (a) demonstrates superior clustering of style categories compared to both the baseline CLIP (b) and VGG19 (c) models.
  • Figure 3: Visual comparison of original images and their reconstructions using the Stage 2 module. In each example, the left image is the original and the right is the reconstruction. The results show high fidelity.
  • Figure 4: Visual comparison of original images and their reconstructions using the fully fine-tuned Stage 3 module. In each example, the left image is the original and the right is the reconstruction. Stage 3 shows a noticeable improvement in capturing fine-grained stylistic details over Stage 2.
  • Figure 5: Reconstruction results from the ablation study without the projection layer. In each example, the left image is the original and the right is the reconstruction. The model fails to generate semantically coherent images.