Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

James Baker

Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

James Baker

TL;DR

The paper tackles enabling creativity in text-to-image generation without labeled data or training a dedicated style classifier. It introduces a diffusion-based framework that optimizes a stylistic-ambiguity reward through a DDPO formulation, using multimodal foundation models and clustering-based classifiers (CLIP-based and K-Means-based) to steer creative outputs. Key contributions include classifier-free creative style-ambiguity losses that improve automated metrics of human judgment compared with GAN-based baselines, and the demonstration of these methods across multiple image resolutions with Stable Diffusion-2 as the base model. The approach reduces labeling overhead, leverages multimodal representations, and suggests generalizable pathways for cross-modal creative generation and broader applicability to other domains. R(x) = - CE(C(x), U) is used to reward outputs that deviate from uniform style predictions, promoting stylistic ambiguity and creativity.

Abstract

Teaching text-to-image models to be creative involves using style ambiguity loss, which requires a pretrained classifier. In this work, we explore a new form of the style ambiguity training objective, used to approximate creativity, that does not require training a classifier or even a labeled dataset. We then train a diffusion model to maximize style ambiguity to imbue the diffusion model with creativity and find our new methods improve upon the traditional method, based on automated metrics for human judgment, while still maintaining creativity and novelty.

Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

TL;DR

Abstract

Paper Structure (28 sections, 24 equations, 8 figures, 17 tables)

This paper contains 28 sections, 24 equations, 8 figures, 17 tables.

Introduction
Related Work
Creativity
Computational Art
Reinforcement Learning
Method
Model
Creative Adversarial Network
Diffusion
Markov Decision Processes
Denoising Diffusion Proximal Optimisation
Reward Function
Data
Choice of Classifier
DCGAN-Based Classifier
...and 13 more sections

Figures (8)

Figure 1: Generator Architecture (Image Dim 512)
Figure 2: Generator Architecture (Image Dim 256)
Figure 3: Generator Architecture (Image Dim 128)
Figure 4: Generator Architecture (Image Dim 64)
Figure 5: Discriminator Architecture (Image Dim 512)
...and 3 more figures

Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

TL;DR

Abstract

Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

Authors

TL;DR

Abstract

Table of Contents

Figures (8)