NSYNC: Negative Synthetic Image Generation for Contrastive Training to Improve Stylized Text-To-Image Translation
Serkan Ozturk, Samet Hicsonmez, Pinar Duygulu
TL;DR
NSYNC tackles fine-grained stylized text-to-image translation by introducing negative synthetic data within a contrastive learning framework. It curates a negative set via a latent diffusion model with generic prompts and refines gradient updates through an orthogonal projection mechanism applied to gradients from positive, negative, and anchor samples, leveraging Textual Inversion as the adaptation backbone. The approach yields consistent improvements across Monet, Van Gogh, Studio Ghibli, and illustrators, as measured by CMMD, CSD, KID, and FID, while maintaining reasonable training costs. This work demonstrates that negative data and gradient refinement can substantially enhance stylistic fidelity without relying on real negative exemplars, and it remains compatible with TI and LoRA paradigms for broader applicability.
Abstract
Current text conditioned image generation methods output realistic looking images, but they fail to capture specific styles. Simply finetuning them on the target style datasets still struggles to grasp the style features. In this work, we present a novel contrastive learning framework to improve the stylization capability of large text-to-image diffusion models. Motivated by the astonishing advance in image generation models that makes synthetic data an intrinsic part of model training in various computer vision tasks, we exploit synthetic image generation in our approach. Usually, the generated synthetic data is dependent on the task, and most of the time it is used to enlarge the available real training dataset. With NSYNC, alternatively, we focus on generating negative synthetic sets to be used in a novel contrastive training scheme along with real positive images. In our proposed training setup, we forward negative data along with positive data and obtain negative and positive gradients, respectively. We then refine the positive gradient by subtracting its projection onto the negative gradient to get the orthogonal component, based on which the parameters are updated. This orthogonal component eliminates the trivial attributes that are present in both positive and negative data and directs the model towards capturing a more unique style. Experiments on various styles of painters and illustrators show that our approach improves the performance over the baseline methods both quantitatively and qualitatively. Our code is available at https://github.com/giddyyupp/NSYNC.
