Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices
Jigyasa Gupta, Soumya Goyal, Anil Kumar, Ishan Jindal
TL;DR
This work tackles real-time, on-device synthesis of cooked-food images conditioned on raw inputs, recipe, and desired doneness. It introduces an edge-efficient FiLM-conditioned U-Net generator guided by sinusoidal recipe-state embeddings and trained with a domain-specific Culinary Image Similarity (CIS) metric, enabling temporally coherent visual progression and a stopping signal for cooking. A novel oven-based progression dataset (1708 sessions, 30 recipes) supports evaluation, where the proposed method achieves state-of-the-art FID/LPIPS with about 8.68M parameters and real-time inference, while CIS provides a robust, on-device progress indicator and training signal. The results demonstrate practical potential for intelligent kitchen appliances, offering interpretable, user-preference-driven visual feedback and a foundation for broader multimodal cooking intelligence.
Abstract
Synthesizing realistic cooked food images from raw inputs on edge devices is a challenging generative task, requiring models to capture complex changes in texture, color and structure during cooking. Existing image-to-image generation methods often produce unrealistic results or are too resource-intensive for edge deployment. We introduce the first oven-based cooking-progression dataset with chef-annotated doneness levels and propose an edge-efficient recipe and cooking state guided generator that synthesizes realistic food images conditioned on raw food image. This formulation enables user-preferred visual targets rather than fixed presets. To ensure temporal consistency and culinary plausibility, we introduce a domain-specific \textit{Culinary Image Similarity (CIS)} metric, which serves both as a training loss and a progress-monitoring signal. Our model outperforms existing baselines with significant reductions in FID scores (30\% improvement on our dataset; 60\% on public datasets)
