Table of Contents
Fetching ...

Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

James Baker

TL;DR

The paper tackles enabling creativity in text-to-image generation without labeled data or training a dedicated style classifier. It introduces a diffusion-based framework that optimizes a stylistic-ambiguity reward through a DDPO formulation, using multimodal foundation models and clustering-based classifiers (CLIP-based and K-Means-based) to steer creative outputs. Key contributions include classifier-free creative style-ambiguity losses that improve automated metrics of human judgment compared with GAN-based baselines, and the demonstration of these methods across multiple image resolutions with Stable Diffusion-2 as the base model. The approach reduces labeling overhead, leverages multimodal representations, and suggests generalizable pathways for cross-modal creative generation and broader applicability to other domains. R(x) = - CE(C(x), U) is used to reward outputs that deviate from uniform style predictions, promoting stylistic ambiguity and creativity.

Abstract

Teaching text-to-image models to be creative involves using style ambiguity loss, which requires a pretrained classifier. In this work, we explore a new form of the style ambiguity training objective, used to approximate creativity, that does not require training a classifier or even a labeled dataset. We then train a diffusion model to maximize style ambiguity to imbue the diffusion model with creativity and find our new methods improve upon the traditional method, based on automated metrics for human judgment, while still maintaining creativity and novelty.

Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

TL;DR

The paper tackles enabling creativity in text-to-image generation without labeled data or training a dedicated style classifier. It introduces a diffusion-based framework that optimizes a stylistic-ambiguity reward through a DDPO formulation, using multimodal foundation models and clustering-based classifiers (CLIP-based and K-Means-based) to steer creative outputs. Key contributions include classifier-free creative style-ambiguity losses that improve automated metrics of human judgment compared with GAN-based baselines, and the demonstration of these methods across multiple image resolutions with Stable Diffusion-2 as the base model. The approach reduces labeling overhead, leverages multimodal representations, and suggests generalizable pathways for cross-modal creative generation and broader applicability to other domains. R(x) = - CE(C(x), U) is used to reward outputs that deviate from uniform style predictions, promoting stylistic ambiguity and creativity.

Abstract

Teaching text-to-image models to be creative involves using style ambiguity loss, which requires a pretrained classifier. In this work, we explore a new form of the style ambiguity training objective, used to approximate creativity, that does not require training a classifier or even a labeled dataset. We then train a diffusion model to maximize style ambiguity to imbue the diffusion model with creativity and find our new methods improve upon the traditional method, based on automated metrics for human judgment, while still maintaining creativity and novelty.
Paper Structure (28 sections, 24 equations, 8 figures, 17 tables)

This paper contains 28 sections, 24 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Generator Architecture (Image Dim 512)
  • Figure 2: Generator Architecture (Image Dim 256)
  • Figure 3: Generator Architecture (Image Dim 128)
  • Figure 4: Generator Architecture (Image Dim 64)
  • Figure 5: Discriminator Architecture (Image Dim 512)
  • ...and 3 more figures