Table of Contents
Fetching ...

VisualChef: Generating Visual Aids in Cooking via Mask Inpainting

Oleh Kuzyk, Zuoyue Li, Marc Pollefeys, Xi Wang

TL;DR

VisualChef tackles the challenge of providing contextual visual guidance for cooking by generating two frame outputs, $f_{action}$ and $f_{final}$, from an initial frame $f_{in}$ and a specified action. It achieves this with a mask-based diffusion pipeline that grounds action-relevant objects, classifies them into Core, Location, and Functional categories, and selectively edits only those regions to maintain scene consistency. A data-curation pipeline extracts initial-action-final triplets from egocentric videos, while a CLIP-based loss guides inpainting quality. Across Ego4D, EGTEA Gaze+, and EK-100, VisualChef outperforms state-of-the-art baselines in semantic alignment (CLIP-based metrics) and maintains competitive or superior image fidelity, with ablations confirming the benefits of targeted masking and per-task fine-tuning. This approach offers practical benefits for instruction-driven cooking assistance and robotics by providing reliable, environment-consistent visual aids without heavy textual alignment or extensive annotations.

Abstract

Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action's execution and the resulting appearance of the object, while preserving the initial frame's environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.

VisualChef: Generating Visual Aids in Cooking via Mask Inpainting

TL;DR

VisualChef tackles the challenge of providing contextual visual guidance for cooking by generating two frame outputs, and , from an initial frame and a specified action. It achieves this with a mask-based diffusion pipeline that grounds action-relevant objects, classifies them into Core, Location, and Functional categories, and selectively edits only those regions to maintain scene consistency. A data-curation pipeline extracts initial-action-final triplets from egocentric videos, while a CLIP-based loss guides inpainting quality. Across Ego4D, EGTEA Gaze+, and EK-100, VisualChef outperforms state-of-the-art baselines in semantic alignment (CLIP-based metrics) and maintains competitive or superior image fidelity, with ablations confirming the benefits of targeted masking and per-task fine-tuning. This approach offers practical benefits for instruction-driven cooking assistance and robotics by providing reliable, environment-consistent visual aids without heavy textual alignment or extensive annotations.

Abstract

Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action's execution and the resulting appearance of the object, while preserving the initial frame's environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.

Paper Structure

This paper contains 33 sections, 1 equation, 22 figures, 11 tables.

Figures (22)

  • Figure 1: Generating contextual action and final state frames via mask inpainting. Given an initial frame and an action, VisualChef generates two frames visualizing both the action's execution and the resulting appearance of the object while preserving the environment depicted in the input frame.
  • Figure 2: The VisualChef pipeline for context-aware inpainting within a cooking scenario. It starts with an Initial Frame ($f_\text{in}$) as input, paired with an Action description (e.g., "cut carrot"). The vision-language model LLaVA liu2023visualinstructiontuning is employed to identify relevant objects and classify them into three categories: Core objects (e.g., "carrot"), Location objects (e.g., "cutting board"), and Functional objects (e.g., "knife" and "hand"). Using the open-vocabulary segmentation model Grounding DINO liu2024groundingdinomarryingdino, the masks for these objects are generated: (1) Core Masks, (2) Location Masks, and (3) Functional Masks. Core objects are Relocated in an additional step (4) as needed. The generation phase involves two different inpainting modules based on Stable Diffusion rombach2022highresolutionimagesynthesislatent, conditioned on different combinations of the masks for creating the Action Frame ($f_\text{action}$) that reflects the step being performed and the Final Frame ($f_\text{final}$) showing the status upon action completion. The output thus visualizes the progression of the cooking scenario in a realistic manner.
  • Figure 3: Chain-of-thoughts reasoning for relevant object identification. Given an input image, we use the chain-of-thoughts strategy to prompt LLaVA liu2023visualinstructiontuning to get a categorized list of objects relevant to the given action.
  • Figure 4: Qualitative comparison with related work. VisualChef has the best performance in aligning generated images to the input action and preserving the environment compared to state-of-the-art methods.
  • Figure 5: Human comparison of SOTA models and VisualChef. The users tend to select images generated by VisualChef more often.
  • ...and 17 more figures