D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models
Javon Hickmon
TL;DR
The paper tackles demographic bias in zero-shot image classification by leveraging multimodal models like CLIP and generative models such as Stable Diffusion XL. It introduces Diverse Demographic Data Generation (D3G), a training-free, inference-time framework that generates diverse demographic prompts and combines their image embeddings with text embeddings via a weighted sum to boost accuracy. Evaluated on the IdenProf dataset with multiple prompting strategies, D3G achieves consistent top-1 improvements, especially for underrepresented demographics, and offers insights into weight allocation and per-class behavior. The work discusses assumptions, limitations, and future directions, highlighting the practical significance of balancing demographic representation without additional training in large-scale multimodal systems.
Abstract
Image classification is a task essential for machine perception to achieve human-level image understanding. Multimodal models such as CLIP have been able to perform well on this task by learning semantic similarities across vision and language; however, despite these advances, image classification is still a challenging task. Models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Along with this, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. When datasets do not enforce balanced demographics, the predictions will be biased toward the more represented class, while others will be neglected. We focus on how these issues can lead to harmful bias for zero-shot image classification, and explore how to combat these issues in demographic bias. We propose Diverse Demographic Data Generation (D3G), a training-free, zero-shot method of boosting classification accuracy while reducing demographic bias in pre-trained multimodal models. With this method, we utilize CLIP as our base multimodal model and Stable Diffusion XL as our generative model. We demonstrate that providing diverse demographic data at inference time improves performance for these models, and explore the impact of individual demographics on the resulting accuracy metric.
