Table of Contents
Fetching ...

NSYNC: Negative Synthetic Image Generation for Contrastive Training to Improve Stylized Text-To-Image Translation

Serkan Ozturk, Samet Hicsonmez, Pinar Duygulu

TL;DR

NSYNC tackles fine-grained stylized text-to-image translation by introducing negative synthetic data within a contrastive learning framework. It curates a negative set via a latent diffusion model with generic prompts and refines gradient updates through an orthogonal projection mechanism applied to gradients from positive, negative, and anchor samples, leveraging Textual Inversion as the adaptation backbone. The approach yields consistent improvements across Monet, Van Gogh, Studio Ghibli, and illustrators, as measured by CMMD, CSD, KID, and FID, while maintaining reasonable training costs. This work demonstrates that negative data and gradient refinement can substantially enhance stylistic fidelity without relying on real negative exemplars, and it remains compatible with TI and LoRA paradigms for broader applicability.

Abstract

Current text conditioned image generation methods output realistic looking images, but they fail to capture specific styles. Simply finetuning them on the target style datasets still struggles to grasp the style features. In this work, we present a novel contrastive learning framework to improve the stylization capability of large text-to-image diffusion models. Motivated by the astonishing advance in image generation models that makes synthetic data an intrinsic part of model training in various computer vision tasks, we exploit synthetic image generation in our approach. Usually, the generated synthetic data is dependent on the task, and most of the time it is used to enlarge the available real training dataset. With NSYNC, alternatively, we focus on generating negative synthetic sets to be used in a novel contrastive training scheme along with real positive images. In our proposed training setup, we forward negative data along with positive data and obtain negative and positive gradients, respectively. We then refine the positive gradient by subtracting its projection onto the negative gradient to get the orthogonal component, based on which the parameters are updated. This orthogonal component eliminates the trivial attributes that are present in both positive and negative data and directs the model towards capturing a more unique style. Experiments on various styles of painters and illustrators show that our approach improves the performance over the baseline methods both quantitatively and qualitatively. Our code is available at https://github.com/giddyyupp/NSYNC.

NSYNC: Negative Synthetic Image Generation for Contrastive Training to Improve Stylized Text-To-Image Translation

TL;DR

NSYNC tackles fine-grained stylized text-to-image translation by introducing negative synthetic data within a contrastive learning framework. It curates a negative set via a latent diffusion model with generic prompts and refines gradient updates through an orthogonal projection mechanism applied to gradients from positive, negative, and anchor samples, leveraging Textual Inversion as the adaptation backbone. The approach yields consistent improvements across Monet, Van Gogh, Studio Ghibli, and illustrators, as measured by CMMD, CSD, KID, and FID, while maintaining reasonable training costs. This work demonstrates that negative data and gradient refinement can substantially enhance stylistic fidelity without relying on real negative exemplars, and it remains compatible with TI and LoRA paradigms for broader applicability.

Abstract

Current text conditioned image generation methods output realistic looking images, but they fail to capture specific styles. Simply finetuning them on the target style datasets still struggles to grasp the style features. In this work, we present a novel contrastive learning framework to improve the stylization capability of large text-to-image diffusion models. Motivated by the astonishing advance in image generation models that makes synthetic data an intrinsic part of model training in various computer vision tasks, we exploit synthetic image generation in our approach. Usually, the generated synthetic data is dependent on the task, and most of the time it is used to enlarge the available real training dataset. With NSYNC, alternatively, we focus on generating negative synthetic sets to be used in a novel contrastive training scheme along with real positive images. In our proposed training setup, we forward negative data along with positive data and obtain negative and positive gradients, respectively. We then refine the positive gradient by subtracting its projection onto the negative gradient to get the orthogonal component, based on which the parameters are updated. This orthogonal component eliminates the trivial attributes that are present in both positive and negative data and directs the model towards capturing a more unique style. Experiments on various styles of painters and illustrators show that our approach improves the performance over the baseline methods both quantitatively and qualitatively. Our code is available at https://github.com/giddyyupp/NSYNC.

Paper Structure

This paper contains 20 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Text-to-image generation results of an off-the-shelf Latent Diffusion Model (LDM) ldm, finetuned LoRA lora and Textual Inversion (TI) ti_model, and our approach, NSYNC, on paintings of Monet at the top and Van Gogh at the bottom. It is clearly visible that LDM generates a generic image without the target style for the given input text. Although, TI ti_model improves over LDM, it still fails to capture style elements. On the other hand, our method NSYNC captures the target style and generates visually similar images to the real paintings. Note that for each method, we append the text in the style of $*$ to the input captions. Note that ground truth (GT) images denote the origin of the input text prompts. Zoom in for details.
  • Figure 2: NSYNC processing pipeline. We start with curating a negative dataset. Next, we finetune the baseline adapter model using our novel contrastive learning formulation. Finally, the inference stage is similar to baseline text-to-image diffusion model.
  • Figure 3: Contrastive training framework of NSYNC. We only train the embedding of the special style token in the Text Encoder similar to Textual Inversion. However, instead of directly updating the weights ($v_{*}$) of the newly added token $S^*$ using only a single positive image, we calculate two projections on the gradients of positive (${\nabla}_{pos}$), negative (${\nabla}_{neg}$) and anchor (${\nabla}_{anchor}$) images. We use these projections to better find the gradient directions, and update the weight $v_{*}$ with the refined gradient ${\nabla}_{*}$
  • Figure 4: Comparison of NSYNC with baseline methods using SD v1.x backbones. Ground Truth (GT) is given on the bottom, and textual description to generate the images in given on the top. Note that ground truth (GT) images are for illustration purposes. Zoom in for details.
  • Figure 5: Visual results of NSYNC on all the datasets. Each row corresponds to one dataset. Ground Truth (GT) is given on the right. Note that ground truth (GT) images are for illustration purposes. Zoom in for details.