Table of Contents
Fetching ...

Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices

Jigyasa Gupta, Soumya Goyal, Anil Kumar, Ishan Jindal

TL;DR

This work tackles real-time, on-device synthesis of cooked-food images conditioned on raw inputs, recipe, and desired doneness. It introduces an edge-efficient FiLM-conditioned U-Net generator guided by sinusoidal recipe-state embeddings and trained with a domain-specific Culinary Image Similarity (CIS) metric, enabling temporally coherent visual progression and a stopping signal for cooking. A novel oven-based progression dataset (1708 sessions, 30 recipes) supports evaluation, where the proposed method achieves state-of-the-art FID/LPIPS with about 8.68M parameters and real-time inference, while CIS provides a robust, on-device progress indicator and training signal. The results demonstrate practical potential for intelligent kitchen appliances, offering interpretable, user-preference-driven visual feedback and a foundation for broader multimodal cooking intelligence.

Abstract

Synthesizing realistic cooked food images from raw inputs on edge devices is a challenging generative task, requiring models to capture complex changes in texture, color and structure during cooking. Existing image-to-image generation methods often produce unrealistic results or are too resource-intensive for edge deployment. We introduce the first oven-based cooking-progression dataset with chef-annotated doneness levels and propose an edge-efficient recipe and cooking state guided generator that synthesizes realistic food images conditioned on raw food image. This formulation enables user-preferred visual targets rather than fixed presets. To ensure temporal consistency and culinary plausibility, we introduce a domain-specific \textit{Culinary Image Similarity (CIS)} metric, which serves both as a training loss and a progress-monitoring signal. Our model outperforms existing baselines with significant reductions in FID scores (30\% improvement on our dataset; 60\% on public datasets)

Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices

TL;DR

This work tackles real-time, on-device synthesis of cooked-food images conditioned on raw inputs, recipe, and desired doneness. It introduces an edge-efficient FiLM-conditioned U-Net generator guided by sinusoidal recipe-state embeddings and trained with a domain-specific Culinary Image Similarity (CIS) metric, enabling temporally coherent visual progression and a stopping signal for cooking. A novel oven-based progression dataset (1708 sessions, 30 recipes) supports evaluation, where the proposed method achieves state-of-the-art FID/LPIPS with about 8.68M parameters and real-time inference, while CIS provides a robust, on-device progress indicator and training signal. The results demonstrate practical potential for intelligent kitchen appliances, offering interpretable, user-preference-driven visual feedback and a foundation for broader multimodal cooking intelligence.

Abstract

Synthesizing realistic cooked food images from raw inputs on edge devices is a challenging generative task, requiring models to capture complex changes in texture, color and structure during cooking. Existing image-to-image generation methods often produce unrealistic results or are too resource-intensive for edge deployment. We introduce the first oven-based cooking-progression dataset with chef-annotated doneness levels and propose an edge-efficient recipe and cooking state guided generator that synthesizes realistic food images conditioned on raw food image. This formulation enables user-preferred visual targets rather than fixed presets. To ensure temporal consistency and culinary plausibility, we introduce a domain-specific \textit{Culinary Image Similarity (CIS)} metric, which serves both as a training loss and a progress-monitoring signal. Our model outperforms existing baselines with significant reductions in FID scores (30\% improvement on our dataset; 60\% on public datasets)

Paper Structure

This paper contains 17 sections, 10 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Given raw food images and text prompts (recipe name, cooking state), our model generates three visually distinguishable, realistic images for each cooking state showing cooking progression. Best viewed in color.
  • Figure 2: Overall generator architecture. The generator takes in input raw image and context information like recipe name and cooking state and generates a cooked image as output. The discriminator, employing a patch-based approach, evaluates both (raw,real-cooked) and (raw, generated-cooked) image pairs, providing the generator with adversarial loss. Additionally, the generator incorporates perceptual losses by comparing real-cooked and generated-cooked images using LPIPS and Culinary Image Similarity Loss. These three losses - adversarial, LPIPS, and CIS are combined to optimize the generator during training
  • Figure 3: Text-guided Conditioned U-Net architecture. Top: Complete U-Net with feature modulation at each layer. Bottom: Detailed modulated layers. Encoder layers $E_l$ use contextual embeddings $E_{p_i}$ (recipe name, cooking state) for $FiLM$-based feature modulation. Decoder layers $D_l$ apply identical modulation for consistent context-aware processing
  • Figure 4: We learn a culinary image similarity metric, $\mathcal{F}_{cul}$, by leveraging temporal distances between image pairs from cooking sessions. The Siamese network, $f_{sim}$, is trained to map cooking stages into an embedding space, where temporal progression is represented as a smooth, continuous trajectory.
  • Figure 5: Generated image quality comparison across baseline methods and our proposed approach. The results demonstrate the performance of all models for a specific cooking state, illustrating that our proposed method produces the most plausible images. (Best viewed in color)
  • ...and 3 more figures