Table of Contents
Fetching ...

CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-Lin, Hong-Han Shuai, Wen-Huang Cheng

TL;DR

CookAnything tackles the challenge of generating coherent, multi-step recipe illustrations from variable-length text instructions. It introduces three innovations—Step-wise Regional Control (SRC) for per-step region binding, Flexible RoPE for step-aware positional encoding, and Cross-Step Consistency Control (CSCC) with a Cooking Agent—to maintain semantic disentanglement, temporal coherence, and ingredient consistency across steps. The framework achieves state-of-the-art performance in both training-based and training-free settings on RecipeGen and VGSI, with strong quantitative metrics and favorable human judgments. The work enables scalable, high-quality visual synthesis of procedural content and has broad potential for instructional media and content creation beyond cooking. Overall, CookAnything provides a unified, extensible approach to structured multi-image generation that aligns closely with textual procedural instructions and ingredient dynamics.

Abstract

Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.

CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

TL;DR

CookAnything tackles the challenge of generating coherent, multi-step recipe illustrations from variable-length text instructions. It introduces three innovations—Step-wise Regional Control (SRC) for per-step region binding, Flexible RoPE for step-aware positional encoding, and Cross-Step Consistency Control (CSCC) with a Cooking Agent—to maintain semantic disentanglement, temporal coherence, and ingredient consistency across steps. The framework achieves state-of-the-art performance in both training-based and training-free settings on RecipeGen and VGSI, with strong quantitative metrics and favorable human judgments. The work enables scalable, high-quality visual synthesis of procedural content and has broad potential for instructional media and content creation beyond cooking. Overall, CookAnything provides a unified, extensible approach to structured multi-image generation that aligns closely with textual procedural instructions and ingredient dynamics.

Abstract

Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.

Paper Structure

This paper contains 27 sections, 12 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overall structure of our CookAnything model, illustrated with a 3-step vegetable pancake recipe. The Cooking Agent reformats the raw recipe into context-tagged steps, supplementing missing ingredient details. Each step is encoded by a T5 Encoder in two ways: (1) all steps are concatenated to capture global context and produce contextual step tokens, and (2) each step is encoded independently to preserve local semantics and generate step tokens. These two types of tokens are fused via weighted averaging. Meanwhile, noisy latent tokens, processed by Flexible RoPE, are fed into DiT. A Step-wise Regional Attention Mask is applied during DiT’s self-attention to constrain attention within each step, ensuring step-wise focus and visual consistency. In the illustration, purple, green, and pink tokens represent Steps 1, 2, and 3, respectively
  • Figure 2: The example from Original RoPE. Visualization comparison between original RoPE and our proposed Flexible RoPE using the example of Lamb Pilaf. With original RoPE, repeated step images appear as early as Step 2. Steps 3 and 6 exhibit positional misalignment, and Step 9 suffers from noticeable blurring. In contrast, Flexible RoPE maintains clear step-wise differentiation, stable spatial alignment, and improved visual sharpness throughout the cooking process.
  • Figure 3: Examples before and after applying Cross-Step Consistency Control (CSCC). Left: Stir-Fried Carrot with Dried Tofu. Without CSCC, the carrot changes from cubes to strips in Step 4. Visualization of contextual tokens (using Flux.1-dev) shows shape continuity is preserved, so CSCC helps maintain a consistent appearance. Right: Steamed Chicken Wings with Taro. In Step 5, taro should appear beneath the wings but disappears without CSCC. Since contextual tokens confirm its presence, CSCC successfully preserves it.
  • Figure 4: Qualitative comparisons. SKD refers to StackedDiffusion, and SD3.5 refers to Stable Diffusion 3.5. Both SD3.5 Flux.1-dev and SKD exhibit issues with ingredient accuracy, discontinuous ingredient shapes, and the generation of incorrect ingredients. In contrast, our model excels in maintaining the shape and continuity of ingredients.
  • Figure 5: Visualization of dishes from different regions.
  • ...and 6 more figures