Understanding the Limitations of Diffusion Concept Algebra Through Food
E. Zhixuan Zeng, Yuhao Chen, Alexander Wong
TL;DR
This study investigates the limitations of Concept Algebra in diffusion-based image generation when applied to food imagery, a domain characterized by multi-object scenes and regional biases. It formalizes a style/content decomposition with style set $\mathcal{Z}$ and content set $\mathcal{W}$, projecting into the style subspace $\mathcal{R}_Z$ and using the edit rule $s_{edit} = (\mathbb{I} - \text{proj}_z)s[x_{orig}] + \text{proj}_z s[x_{new}]$ to compare transformations. By employing subspace visualization, Jensen-Shannon distance $JS(\cdot\|\cdot)$, and silhouette scores, the work reveals entanglement between concepts, strong dependence on prompt wording for causal separability, and a texture-biased tendency in food edits. These findings provide concrete metrics to guide prompt engineering and underscore the necessity of diverse testing domains to better understand biases in latent diffusion models.
Abstract
Image generation techniques, particularly latent diffusion models, have exploded in popularity in recent years. Many techniques have been developed to manipulate and clarify the semantic concepts these large-scale models learn, offering crucial insights into biases and concept relationships. However, these techniques are often only validated in conventional realms of human or animal faces and artistic style transitions. The food domain offers unique challenges through complex compositions and regional biases, which can shed light on the limitations and opportunities within existing methods. Through the lens of food imagery, we analyze both qualitative and quantitative patterns within a concept traversal technique. We reveal measurable insights into the model's ability to capture and represent the nuances of culinary diversity, while also identifying areas where the model's biases and limitations emerge.
