Diminishing Stereotype Bias in Image Generation Model using Reinforcemenlent Learning Feedback
Xin Chen, Virgile Foussereau
TL;DR
This work tackles gender stereotype bias in diffusion-based image generation by introducing a Reinforcement Learning from Artificial Intelligence Feedback (RLAIF) pipeline built on Denoising Diffusion Policy Optimization (DDPO). A pretrained Stable Diffusion model and a gender-classification Transformer generate bias-aware rewards, $R_{shift}$ and $R_{balance}$, to shift and then balance gender representation without extra data or prompt changes. Empirical results demonstrate rapid bias shifts and eventual gender balance while maintaining image quality, and an alternative trust-region approach yields limited gains due to KL-estimation challenges. The approach lays a foundation for extending bias mitigation to other forms of bias and prompting configurations, contributing to broader responsible AI development.
Abstract
This study addresses gender bias in image generation models using Reinforcement Learning from Artificial Intelligence Feedback (RLAIF) with a novel Denoising Diffusion Policy Optimization (DDPO) pipeline. By employing a pretrained stable diffusion model and a highly accurate gender classification Transformer, the research introduces two reward functions: Rshift for shifting gender imbalances, and Rbalance for achieving and maintaining gender balance. Experiments demonstrate the effectiveness of this approach in mitigating bias without compromising image quality or requiring additional data or prompt modifications. While focusing on gender bias, this work establishes a foundation for addressing various forms of bias in AI systems, emphasizing the need for responsible AI development. Future research directions include extending the methodology to other bias types, enhancing the RLAIF pipeline's robustness, and exploring multi-prompt fine-tuning to further advance fairness and inclusivity in AI.
