Food Image Generation on Multi-Noun Categories
Xinyue Pan, Yuhao Chen, Jiangpeng He, Fengqing Zhu
TL;DR
This work tackles the challenge of generating food images for multi-noun compound prompts in diffusion-based models. It introduces FoCULR, combining FDALA (food-domain local alignment) to fine-tune attention and align patch-level concepts with category semantics, and CFIG (core-focused image generation) to impose head-noun–driven layouts via early negative prompts. By jointly optimizing $L_{align}$ and $L_{ga}$ during fine-tuning and using head-noun guidance during inference, FoCULR improves object coherence and reduces redundant components on VFN and UEC-256, with ablations showing complementary gains and some generalization to non-food domains. While effective, the approach may still face occasional object entanglement in complex scenes, pointing to future work on enhancing spatial relationships between multiple items in a single image.
Abstract
Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt "egg noodle" may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.
