CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion
Yuan Wang, Bin Zhu, Yanbin Hao, Chong-Wah Ngo, Yi Tan, Xiang Wang
TL;DR
CookingDiffusion introduces a memory-augmented diffusion framework for cooking procedural image generation, enabling a sequence of step-aligned images that remain consistent across a recipe. It adds Text Memory Net, Image Memory Net, and Multi-modal Memory Net to Stable Diffusion to process textual, visual, or mixed procedural prompts, and it evaluates on a YouCookII-derived benchmark with a novel Avg-PCon metric. The approach achieves superior image fidelity and sequential coherence across scenarios and demonstrates flexible content manipulation, laying groundwork for interactive cooking simulations. A publicly available benchmark and codebase are proposed to accelerate research in procedural visual generation for food and related domains.
Abstract
Recent advancements in text-to-image generation models have excelled in creating diverse and realistic images. This success extends to food imagery, where various conditional inputs like cooking styles, ingredients, and recipes are utilized. However, a yet-unexplored challenge is generating a sequence of procedural images based on cooking steps from a recipe. This could enhance the cooking experience with visual guidance and possibly lead to an intelligent cooking simulation system. To fill this gap, we introduce a novel task called \textbf{cooking procedural image generation}. This task is inherently demanding, as it strives to create photo-realistic images that align with cooking steps while preserving sequential consistency. To collectively tackle these challenges, we present \textbf{CookingDiffusion}, a novel approach that leverages Stable Diffusion and three innovative Memory Nets to model procedural prompts. These prompts encompass text prompts (representing cooking steps), image prompts (corresponding to cooking images), and multi-modal prompts (mixing cooking steps and images), ensuring the consistent generation of cooking procedural images. To validate the effectiveness of our approach, we preprocess the YouCookII dataset, establishing a new benchmark. Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images with remarkable consistency across sequential cooking steps, as measured by both the FID and the proposed Average Procedure Consistency metrics. Furthermore, CookingDiffusion demonstrates the ability to manipulate ingredients and cooking methods in a recipe. We will make our code, models, and dataset publicly accessible.
