Table of Contents
Fetching ...

A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu, Guoliang Kang

TL;DR

The paper tackles the challenge of generating novel, visually consistent styles without reference images or lengthy prompts by introducing code-to-style generation. It presents CoTyle, the first open-source framework that learns a discrete style codebook and an autoregressive style generator to condition a diffusion-based text-to-image model on style embeddings, enabling style synthesis from numerical codes. Through extensive experiments, CoTyle demonstrates high style consistency, competitive creativity, and the ability to interpolate between styles, while also supporting image-conditioned generation and style interpolation. The work offers a reproducible, portable approach to open-ended style design and paves the way for further research on discrete stylistic representations across modalities.

Abstract

Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.

A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

TL;DR

The paper tackles the challenge of generating novel, visually consistent styles without reference images or lengthy prompts by introducing code-to-style generation. It presents CoTyle, the first open-source framework that learns a discrete style codebook and an autoregressive style generator to condition a diffusion-based text-to-image model on style embeddings, enabling style synthesis from numerical codes. Through extensive experiments, CoTyle demonstrates high style consistency, competitive creativity, and the ability to interpolate between styles, while also supporting image-conditioned generation and style interpolation. The work offers a reproducible, portable approach to open-ended style design and paves the way for further research on discrete stylistic representations across modalities.

Abstract

Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.

Paper Structure

This paper contains 21 sections, 5 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Visual demonstration of CoTyle. The figure shows five groups of samples (top-left, top-right, bottom-left, bottom-right, center), each generated from a distinct style code, with consistent style within each group and distinct styles across groups. Our homepage can be found in https://kwai-kolors.github.io/CoTyle/.
  • Figure 2: Different to previous methods, CoTyle uses a numerical style code to represent a style, eliminating the need for complex prompts, images, or LoRAs, and allowing easy creation of unique styles just modifying the code. "Creativity", "Consistency", and "Reproducibility" refer to a model’s ability to (1) generate novel styles, (2) produce multiple images in the same style consistently, and (3) reproduce styles using simple, user-friendly style definitions.
  • Figure 3: Overview of CoTyle. (a) We first train a style codebook and an image generation model conditioned on style images. (b) Then, we use the corresponding codebook indices of the style images to train an autoregressive style generator. (c) During inference, a style code is used to randomly sample the first index and autoregressively predict the rest.
  • Figure 4: Qualitative comparison with Midjourney mj on code-to-style generation. Each image set (2×3 grid) is generated from the same style code. Red boxes highlight cases with suboptimal style consistency.
  • Figure 5: We compare injecting style through textual branch with the existing method through visual branch. Injecting style from the textual branch better preserves semantic information.
  • ...and 9 more figures