Table of Contents
Fetching ...

OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, Patrick Carrington

TL;DR

OSCAR tackles the challenge of non-visual cooking by tracking object status changes to align visual frames with recipe steps and provide context-aware guidance. It combines recipe processing, object-status extraction, visual data alignment, and a time-causal model, leveraging VLMs and LLMs to predict progress and answer contextual questions. Across a YouCook2-based large dataset and a real-world non-visual cooking dataset, OSCAR achieves substantial improvements over baselines (e.g., CLIP: 41.7% to 68.0%; SigLIP: 62.2% to 82.8% on YouCook2), demonstrating the benefits of object-status information for robust step tracking. The work introduces a new non-visual cooking dataset and discusses factors affecting performance in real-world settings, highlighting practical considerations for accessibility-focused cooking assistance and future expansions to interfaces and datasets.

Abstract

Following recipes while cooking is an important but difficult task for visually impaired individuals. We developed OSCAR (Object Status Context Awareness for Recipes), a novel approach that provides recipe progress tracking and context-aware feedback on the completion of cooking tasks through tracking object statuses. OSCAR leverages both Large-Language Models (LLMs) and Vision-Language Models (VLMs) to manipulate recipe steps, extract object status information, align visual frames with object status, and provide cooking progress tracking log. We evaluated OSCAR's recipe following functionality using 173 YouTube cooking videos and 12 real-world non-visual cooking videos to demonstrate OSCAR's capability to track cooking steps and provide contextual guidance. Our results highlight the effectiveness of using object status to improve performance compared to baseline by over 20% across different VLMs, and we present factors that impact prediction performance. Furthermore, we contribute a dataset of real-world non-visual cooking videos with step annotations as an evaluation benchmark.

OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

TL;DR

OSCAR tackles the challenge of non-visual cooking by tracking object status changes to align visual frames with recipe steps and provide context-aware guidance. It combines recipe processing, object-status extraction, visual data alignment, and a time-causal model, leveraging VLMs and LLMs to predict progress and answer contextual questions. Across a YouCook2-based large dataset and a real-world non-visual cooking dataset, OSCAR achieves substantial improvements over baselines (e.g., CLIP: 41.7% to 68.0%; SigLIP: 62.2% to 82.8% on YouCook2), demonstrating the benefits of object-status information for robust step tracking. The work introduces a new non-visual cooking dataset and discusses factors affecting performance in real-world settings, highlighting practical considerations for accessibility-focused cooking assistance and future expansions to interfaces and datasets.

Abstract

Following recipes while cooking is an important but difficult task for visually impaired individuals. We developed OSCAR (Object Status Context Awareness for Recipes), a novel approach that provides recipe progress tracking and context-aware feedback on the completion of cooking tasks through tracking object statuses. OSCAR leverages both Large-Language Models (LLMs) and Vision-Language Models (VLMs) to manipulate recipe steps, extract object status information, align visual frames with object status, and provide cooking progress tracking log. We evaluated OSCAR's recipe following functionality using 173 YouTube cooking videos and 12 real-world non-visual cooking videos to demonstrate OSCAR's capability to track cooking steps and provide contextual guidance. Our results highlight the effectiveness of using object status to improve performance compared to baseline by over 20% across different VLMs, and we present factors that impact prediction performance. Furthermore, we contribute a dataset of real-world non-visual cooking videos with step annotations as an evaluation benchmark.

Paper Structure

This paper contains 31 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of OSCAR, application of progress tracking, and context-aware Q/A. (A) Recipe Formatting and Object Status Extraction, (B) Visual Data Extraction and Recipe Step Alignment, (C) Log of Similarity Metrics and Sequential Predictions, (D) Time-Causal Model, (E) Progress Tracking, (F) Context-aware Q/A.
  • Figure 2: Image frames of cooking video that show similar visual frames of different steps that caused misprediction.
  • Figure 3: Thumbnail of the non-Visual cooking dataset of 12 videos by people with vision impairments.
  • Figure 4: These three frames were captured for V4 during step 1: 'Crack an egg and scramble it.' The middle frame showed the data where the blind cook was looking for a garbage bin to throw the egg shell, which got captured and impacted the performance of predicting the step.