CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion

Yuan Wang; Bin Zhu; Yanbin Hao; Chong-Wah Ngo; Yi Tan; Xiang Wang

CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion

Yuan Wang, Bin Zhu, Yanbin Hao, Chong-Wah Ngo, Yi Tan, Xiang Wang

TL;DR

CookingDiffusion introduces a memory-augmented diffusion framework for cooking procedural image generation, enabling a sequence of step-aligned images that remain consistent across a recipe. It adds Text Memory Net, Image Memory Net, and Multi-modal Memory Net to Stable Diffusion to process textual, visual, or mixed procedural prompts, and it evaluates on a YouCookII-derived benchmark with a novel Avg-PCon metric. The approach achieves superior image fidelity and sequential coherence across scenarios and demonstrates flexible content manipulation, laying groundwork for interactive cooking simulations. A publicly available benchmark and codebase are proposed to accelerate research in procedural visual generation for food and related domains.

Abstract

Recent advancements in text-to-image generation models have excelled in creating diverse and realistic images. This success extends to food imagery, where various conditional inputs like cooking styles, ingredients, and recipes are utilized. However, a yet-unexplored challenge is generating a sequence of procedural images based on cooking steps from a recipe. This could enhance the cooking experience with visual guidance and possibly lead to an intelligent cooking simulation system. To fill this gap, we introduce a novel task called \textbf{cooking procedural image generation}. This task is inherently demanding, as it strives to create photo-realistic images that align with cooking steps while preserving sequential consistency. To collectively tackle these challenges, we present \textbf{CookingDiffusion}, a novel approach that leverages Stable Diffusion and three innovative Memory Nets to model procedural prompts. These prompts encompass text prompts (representing cooking steps), image prompts (corresponding to cooking images), and multi-modal prompts (mixing cooking steps and images), ensuring the consistent generation of cooking procedural images. To validate the effectiveness of our approach, we preprocess the YouCookII dataset, establishing a new benchmark. Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images with remarkable consistency across sequential cooking steps, as measured by both the FID and the proposed Average Procedure Consistency metrics. Furthermore, CookingDiffusion demonstrates the ability to manipulate ingredients and cooking methods in a recipe. We will make our code, models, and dataset publicly accessible.

CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion

TL;DR

Abstract

Paper Structure (23 sections, 3 equations, 9 figures, 3 tables)

This paper contains 23 sections, 3 equations, 9 figures, 3 tables.

Introduction
Related Work
Method
Problem Definition
Model Overview
Text Memory Net (TMN)
Image Memory Net (IMN)
Multi-modalilty Memory Net (MMN)
Improved baselines
StackGAN-based Method
VQ Diffusion-based Method
Control Net-based Method
Control Net with procedural text prompts
Control Net with procedural image prompts
Experiment
...and 8 more sections

Figures (9)

Figure 1: Tasks comparison of traditional text-to-image generation, recipe-to-image generation, and our proposed cooking procedural image generation. (a) Traditional text-to-image generation involves generating an image based on a textual description. (b) Recipe-to-image generation aims to generate the final dish image based on the entire recipe. (c) Our proposed cooking procedural image generation task aims to generate a sequence of consistent cooking images that correspond to each specific step in a recipe, where the current step (yellow blocks) is regarded as the conditional prompt and the previous contextual steps (blue blocks) as procedural prompts.
Figure 2: Overview of our proposed CookingDiffusion. We introduce three different Memory Nets for CookingDiffusion. (a) The Text Memory Net is tailored for processing text-based procedural prompts, while (b) the Image Memory Net is dedicated to handling image-based procedural prompts. (c) The Multi-modality Memory Net is introduced to deal with procedural prompts from different modalities.
Figure 3: Overview of Control Net with procedural text and image prompts. Various modifications are implemented on the Control Net to facilitate the learning and involvement of procedural prompts.
Figure 4: (a) is the architecture of the original projection network, and (b), (c) are two proposed temporal projection networks. (b). Temporal Projection Network A (TP-A) (c). Temporal Projection Network B (TP-B). For the sake of simplicity, the activation functions have been omitted here.
Figure 5: Comparison of the generated procedural images using CookingDiffusion with TMN and IMN and Stable Diffusion.
...and 4 more figures

CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion

TL;DR

Abstract

CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (9)