SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation

Zhixuan Liu; Peter Schaldenbrand; Beverley-Claire Okogwu; Wenxuan Peng; Youngsik Yun; Andrew Hundt; Jihie Kim; Jean Oh

SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation

Zhixuan Liu, Peter Schaldenbrand, Beverley-Claire Okogwu, Wenxuan Peng, Youngsik Yun, Andrew Hundt, Jihie Kim, Jean Oh

TL;DR

Current generative image models trained on large web-scale data propagate cultural stereotypes and misrepresentations. The authors address this by introducing CCUB, a small, culturally representative dataset collected by communities, and SCoFT, a self-contrastive fine-tuning framework that leverages the model's own biases to correct high-level cultural representations. SCoFT combines a latent-diffusion loss (L_LDM), a memorization penalty (L_M), a decoded-space perceptual loss (L_P), and a Self-Contrastive Perceptual Loss (L_C) to shift generation away from biased priors while remaining data-efficient; it uses a guided negative set via ControlNet-depth to form a triplet objective. In a human study with 51 participants across five cultures, SCoFT substantially reduces offensiveness and increases cultural relevance, with the SCoFT+MPC variant consistently ranking highest across evaluation criteria. The work demonstrates a practical approach to equitable image generation and highlights the importance of curated cultural data and perceptual, contrastive training for responsible AI deployment.

Abstract

Accurate representation in media is known to improve the well-being of the people who consume it. Generative image models trained on large web-crawled datasets such as LAION are known to produce images with harmful stereotypes and misrepresentations of cultures. We improve inclusive representation in generated images by (1) engaging with communities to collect a culturally representative dataset that we call the Cross-Cultural Understanding Benchmark (CCUB) and (2) proposing a novel Self-Contrastive Fine-Tuning (SCoFT) method that leverages the model's known biases to self-improve. SCoFT is designed to prevent overfitting on small datasets, encode only high-level information from the data, and shift the generated distribution away from misrepresentations encoded in a pretrained model. Our user study conducted on 51 participants from 5 different countries based on their self-selected national cultural affiliation shows that fine-tuning on CCUB consistently generates images with higher cultural relevance and fewer stereotypes when compared to the Stable Diffusion baseline, which is further improved with our SCoFT technique.

SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation

TL;DR

Abstract

Paper Structure (24 sections, 5 equations, 25 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 5 equations, 25 figures, 5 tables, 1 algorithm.

Introduction
Related Work
CCUB Dataset
Method
Latent Diffusion Model Loss
Memorization Loss
Perceptual Loss
Self-Contrastive Perceptual Loss
Experiments
Results
Conclusion
SCoFT Method
SCoFT Pseudocode
Training Details
CCUB Dataset Scale
...and 9 more sections

Figures (25)

Figure 1: Comparison between Stable Diffusion with and without our proposed fine-tuning approach, SCoFT, on our proposed CCUB dataset. Stable Diffusion perpetuates harmful stereotypes that assume dirty buildings are representative of some nations, and often generates regionally irrelevant designs. By contrast, our approach decreases stereotypes and improves cultural relevance of generated images.
Figure 2: Sample cultural images and captions from our proposed CCUB dataset.
Figure 3: SCoFT Overview. A conventional fine-tuning loss, $\mathcal{L}_{LDM}$, and memorization penalty loss, $\mathcal{L}_{M}$, are computed in the Stable Diffusion latent space using images and captions from our CCUB dataset. After 20 denoising steps, the latent space is decoded. Perceptual features are extracted from the generated image and compared contrastively to CCUB images as positive and non-fined-tuned Stable Diffusion images as negative examples to form our Self-Contrastive Perceptual Loss, ${\mathcal{L}_{C}}$.
Figure 4: Qualitative comparison of our SCoFT model ablated and compared to Stable Diffusion without fine-tuning.
Figure 5: Violin plot of participant rankings across the survey items and countries. A wider strip means more answers with that value. Each new loss in our ablation study improved the rankings, and SCoFT+MPC is best. (Rank 1 is the best; 4, the worst)
...and 20 more figures

SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation

TL;DR

Abstract

SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (25)