Table of Contents
Fetching ...

Food Image Generation on Multi-Noun Categories

Xinyue Pan, Yuhao Chen, Jiangpeng He, Fengqing Zhu

TL;DR

This work tackles the challenge of generating food images for multi-noun compound prompts in diffusion-based models. It introduces FoCULR, combining FDALA (food-domain local alignment) to fine-tune attention and align patch-level concepts with category semantics, and CFIG (core-focused image generation) to impose head-noun–driven layouts via early negative prompts. By jointly optimizing $L_{align}$ and $L_{ga}$ during fine-tuning and using head-noun guidance during inference, FoCULR improves object coherence and reduces redundant components on VFN and UEC-256, with ablations showing complementary gains and some generalization to non-food domains. While effective, the approach may still face occasional object entanglement in complex scenes, pointing to future work on enhancing spatial relationships between multiple items in a single image.

Abstract

Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt "egg noodle" may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.

Food Image Generation on Multi-Noun Categories

TL;DR

This work tackles the challenge of generating food images for multi-noun compound prompts in diffusion-based models. It introduces FoCULR, combining FDALA (food-domain local alignment) to fine-tune attention and align patch-level concepts with category semantics, and CFIG (core-focused image generation) to impose head-noun–driven layouts via early negative prompts. By jointly optimizing and during fine-tuning and using head-noun guidance during inference, FoCULR improves object coherence and reduces redundant components on VFN and UEC-256, with ablations showing complementary gains and some generalization to non-food domains. While effective, the approach may still face occasional object entanglement in complex scenes, pointing to future work on enhancing spatial relationships between multiple items in a single image.

Abstract

Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt "egg noodle" may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.

Paper Structure

This paper contains 18 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Example multi-noun food images generated by stable diffusion v1-4 and corresponding reference images on selected categories.
  • Figure 2: Overview of our method FoCULR: During fine-tuning phase, the pretrained weights on LAION-5B dataset for stable diffusion UNet model and pretrained weights on 400 million image-text pairs for CLIP model radford2021learning are loaded for finetuning. The loss function is a combination of image-concept alignment loss $L_{align}$ and reconstruction loss $L_{rec}$ that learns food domain knowledge on both the UNet and the text encoder. During inference, negative prompt is activated only when $t<t_{threshold}$ during denoising steps of inference phase. The negative prompt is generated by GPT-4o head noun identification from the input prompt. The denoising steps are scheduled according to the DDPM (Denoising Diffusion Probabilistic Model)ho2020denoising.
  • Figure 3: Comparison of food image generation results to related works (Stable diffusion, Structured diffusion, Syngen, TextCraftor and Stable Diffusion 3) and ablation studies on our methods (CFIG, FDALA): Our method performs well by generating food objects with multi-noun categories without redundant food objects while the prior works tend to generate irrelevant food objects.
  • Figure 4: Attention map for generated image with prompt of "A photo of a corn dog." and negative prompt of "corn" if CFIG applied.
  • Figure 5: Attention map for generated image with prompt of "A photo of a egg sandwich." and negative prompt of "egg" if CFIG applied.
  • ...and 4 more figures