Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss
James Baker
TL;DR
The paper tackles enabling creativity in text-to-image generation without labeled data or training a dedicated style classifier. It introduces a diffusion-based framework that optimizes a stylistic-ambiguity reward through a DDPO formulation, using multimodal foundation models and clustering-based classifiers (CLIP-based and K-Means-based) to steer creative outputs. Key contributions include classifier-free creative style-ambiguity losses that improve automated metrics of human judgment compared with GAN-based baselines, and the demonstration of these methods across multiple image resolutions with Stable Diffusion-2 as the base model. The approach reduces labeling overhead, leverages multimodal representations, and suggests generalizable pathways for cross-modal creative generation and broader applicability to other domains. R(x) = - CE(C(x), U) is used to reward outputs that deviate from uniform style predictions, promoting stylistic ambiguity and creativity.
Abstract
Teaching text-to-image models to be creative involves using style ambiguity loss, which requires a pretrained classifier. In this work, we explore a new form of the style ambiguity training objective, used to approximate creativity, that does not require training a classifier or even a labeled dataset. We then train a diffusion model to maximize style ambiguity to imbue the diffusion model with creativity and find our new methods improve upon the traditional method, based on automated metrics for human judgment, while still maintaining creativity and novelty.
