One-Shot Structure-Aware Stylized Image Synthesis

Hansam Cho; Jonghyun Lee; Seunggyu Chang; Yonghyun Jeong

One-Shot Structure-Aware Stylized Image Synthesis

Hansam Cho, Jonghyun Lee, Seunggyu Chang, Yonghyun Jeong

TL;DR

OSASIS tackles the challenge of one-shot stylization by explicitly separating structure from semantics within a diffusion-based framework. It leverages a structural latent code $\mathbf{x}_{\mathbf{t}_0}$ and a semantic latent code $\mathbf{z}_{\mathrm{sem}}$, enhanced by a structure-preserving network and CLIP directional losses to bridge input and style domains, enabling robust structure preservation even with out-of-domain references and enabling text-driven manipulation. The approach demonstrates superior structure fidelity and style transfer compared to baselines across multiple datasets, while also enabling stylization from rare input structures and supporting real-time content/style mixing. The work introduces practical benefits for diffusion-based stylization, offering improved robustness and controllability at the cost of longer training times and per-style training, with future work aimed at efficiency and generalization across styles.

Abstract

While GAN-based models have been successful in image stylization tasks, they often struggle with structure preservation while stylizing a wide range of input images. Recently, diffusion models have been adopted for image stylization but still lack the capability to maintain the original quality of input images. Building on this, we propose OSASIS: a novel one-shot stylization method that is robust in structure preservation. We show that OSASIS is able to effectively disentangle the semantics from the structure of an image, allowing it to control the level of content and style implemented to a given input. We apply OSASIS to various experimental settings, including stylization with out-of-domain reference images and stylization with text-driven manipulation. Results show that OSASIS outperforms other stylization methods, especially for input images that were rarely encountered during training, providing a promising solution to stylization via diffusion models.

One-Shot Structure-Aware Stylized Image Synthesis

TL;DR

OSASIS tackles the challenge of one-shot stylization by explicitly separating structure from semantics within a diffusion-based framework. It leverages a structural latent code

and a semantic latent code

, enhanced by a structure-preserving network and CLIP directional losses to bridge input and style domains, enabling robust structure preservation even with out-of-domain references and enabling text-driven manipulation. The approach demonstrates superior structure fidelity and style transfer compared to baselines across multiple datasets, while also enabling stylization from rare input structures and supporting real-time content/style mixing. The work introduces practical benefits for diffusion-based stylization, offering improved robustness and controllability at the cost of longer training times and per-style training, with future work aimed at efficiency and generalization across styles.

Abstract

Paper Structure (42 sections, 14 equations, 14 figures, 4 tables)

This paper contains 42 sections, 14 equations, 14 figures, 4 tables.

Introduction
Background
Diffusion Models
Diffusion Autoencoders
Methods
Training
Structural Latent Code
Structure-Preserving Network
Loss Function
Sampling
Mixing Content and Style
Text-driven Manipulation
Experiments
Qualitative Comparison
Quantitative Comparison
...and 27 more sections

Figures (14)

Figure 1: Overview of OSASIS. During finetuning, cross-domain loss compares the photorealistic image (bounded yellow) to its stylized counterparts (bounded green). Concurrently, the in-domain loss gauges the alignment of directional shifts within the same domain, which are delineated by yellow and green. Reconstruction loss compares the original style image with a reconstructed counterpart. Intuitively, the combination of the directional losses guarantees that for each iteration the generated $I^{\mathrm{in}}_B$ is positioned for projection vectors from $I^{\mathrm{style}}_B$ and $I^{\mathrm{in}}_A$ to be collinear to its cross-domain and in-domain counterparts in the CLIP space.
Figure 2: High and Low-density images. Full Recon. refers to reconstruction via conditioning its encoded $\mathbf{z}_{\mathrm{sem}}$ and $\mathbf{x}_T$, whereas Stochastic Recon. refers to reconstruction via conditioning its encoded $\mathbf{z}_{\mathrm{sem}}$ and $\mathbf{x}_T$$\sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
Figure 3: Comparison with other stylization methods. Note that our method successfully preserves the low-density attributes while other baseline methods fail to do so.
Figure 4: Stylization with OOD reference images. Due to the limited capabilities of GAN-based inversion methods, the baseline methods fail in disentangling the structure and semantics of the style image. This results in structural artifacts being transferred into the output image, whereas OSASIS successfully extracts only the semantics.
Figure 5: Stylization result of OSASIS on LSUN-church, AFHQ-dog, and DeepFashion.
...and 9 more figures

One-Shot Structure-Aware Stylized Image Synthesis

TL;DR

Abstract

One-Shot Structure-Aware Stylized Image Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (14)