Table of Contents
Fetching ...

Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics

Jinghao Hu, Yuhe Zhang, GuoHua Geng, Liuyuxin Yang, JiaRui Yan, Jingtao Cheng, YaDong Zhang, Kang Li

TL;DR

This study proposes a zero-shot scheme for image variation with coordinated semantics that yields highly plausible results in a zero-shot manner, particularly for generating stylized images with high-fidelity semantics.

Abstract

Traditionally, style has been primarily considered in terms of artistic elements such as colors, brushstrokes, and lighting. However, identical semantic subjects, like people, boats, and houses, can vary significantly across different artistic traditions, indicating that style also encompasses the underlying semantics. Therefore, in this study, we propose a zero-shot scheme for image variation with coordinated semantics. Specifically, our scheme transforms the image-to-image problem into an image-to-text-to-image problem. The image-to-text operation employs vision-language models e.g., BLIP) to generate text describing the content of the input image, including the objects and their positions. Subsequently, the input style keyword is elaborated into a detailed description of this style and then merged with the content text using the reasoning capabilities of ChatGPT. Finally, the text-to-image operation utilizes a Diffusion model to generate images based on the text prompt. To enable the Diffusion model to accommodate more styles, we propose a fine-tuning strategy that injects text and style constraints into cross-attention. This ensures that the output image exhibits similar semantics in the desired style. To validate the performance of the proposed scheme, we constructed a benchmark comprising images of various styles and scenes and introduced two novel metrics. Despite its simplicity, our scheme yields highly plausible results in a zero-shot manner, particularly for generating stylized images with high-fidelity semantics.

Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics

TL;DR

This study proposes a zero-shot scheme for image variation with coordinated semantics that yields highly plausible results in a zero-shot manner, particularly for generating stylized images with high-fidelity semantics.

Abstract

Traditionally, style has been primarily considered in terms of artistic elements such as colors, brushstrokes, and lighting. However, identical semantic subjects, like people, boats, and houses, can vary significantly across different artistic traditions, indicating that style also encompasses the underlying semantics. Therefore, in this study, we propose a zero-shot scheme for image variation with coordinated semantics. Specifically, our scheme transforms the image-to-image problem into an image-to-text-to-image problem. The image-to-text operation employs vision-language models e.g., BLIP) to generate text describing the content of the input image, including the objects and their positions. Subsequently, the input style keyword is elaborated into a detailed description of this style and then merged with the content text using the reasoning capabilities of ChatGPT. Finally, the text-to-image operation utilizes a Diffusion model to generate images based on the text prompt. To enable the Diffusion model to accommodate more styles, we propose a fine-tuning strategy that injects text and style constraints into cross-attention. This ensures that the output image exhibits similar semantics in the desired style. To validate the performance of the proposed scheme, we constructed a benchmark comprising images of various styles and scenes and introduced two novel metrics. Despite its simplicity, our scheme yields highly plausible results in a zero-shot manner, particularly for generating stylized images with high-fidelity semantics.

Paper Structure

This paper contains 22 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Given an RGB image and a style keyword, our image-to-text-to-image scheme generates the image variations in the target style with coordinated semantics. These results look like images in different worlds.
  • Figure 2: Existing style transfer methods prioritize retaining content while adjusting color and brushstrokes, often resulting in images that lack authenticity in style. Furthermore, it is challenging to achieve satisfactory transfer results when an image with an applied style is used as the input for a continuous transfer task.
  • Figure 3: Our image-to-text-to-image scheme: The source image is first input into the image-to-text module, which consists of BLIP and BLIP-VQA, to obtain the image content with the location of objects. Style keywords are then integrated with this content using ChatGPT to create a text prompt for the text-to-image module. Finally, the text-to-image module incorporating a latent diffusion model, generates the image with the same content in the desired style.
  • Figure 4: The Flowchart of Conditional Constraints. The feature values from CLIP or Swin style encoder are encoded to get feature sequence, which are put into CrossAttnDownBlock's Cross-attention layer subject to calculate.
  • Figure 5: The comparison of visual results with multi-conditional image generation methods. Our approach more accurately preserves semantics and generates images with distinctly desired styles.
  • ...and 1 more figures