Recipe Generation from Unsegmented Cooking Videos

Taichi Nishimura; Atsushi Hashimoto; Yoshitaka Ushiku; Hirotaka Kameko; Shinsuke Mori

Recipe Generation from Unsegmented Cooking Videos

Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, Shinsuke Mori

TL;DR

This work tackles the challenging problem of generating coherent recipes from unsegmented cooking videos by jointly extracting key cooking events and generating grounded sentences. It introduces a transformer-based multimodal recurrent model with an event selector and a sentence generator that memorize and mix histories to produce a story-aware sequence of steps, grounded in visual content. An extended model adds a dot-product visual simulator and textual attention to incorporate ingredient state transitions and verbalization, using YouCook2 for evaluation; the base model outperforms state-of-the-art DVC methods on story-oriented metrics, and the extended model further improves grounding and event sequencing. The approach offers a practical path to readable multimedia recipe summaries and could benefit education, content summarization, and cooking assistance systems, especially when narrated data is unavailable.

Abstract

This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is crucial, and a model should extract an appropriate number of events in the correct order and generate accurate sentences based on them. We analyze the output of the DVC model and confirm that although (1) several events are adoptable as a recipe story, (2) the generated sentences for such events are not grounded in the visual content. Based on this, we set our goal to obtain correct recipes by selecting oracle events from the output events and re-generating sentences for them. To achieve this, we propose a transformer-based multimodal recurrent approach of training an event selector and sentence generator for selecting oracle events from the DVC's events and generating sentences for them. In addition, we extend the model by including ingredients to generate more accurate recipes. The experimental results show that the proposed method outperforms state-of-the-art DVC models. We also confirm that, by modeling the recipe in a story-aware manner, the proposed model outputs the appropriate number of events in the correct order.

Recipe Generation from Unsegmented Cooking Videos

TL;DR

Abstract

Paper Structure (30 sections, 11 equations, 11 figures, 10 tables)

This paper contains 30 sections, 11 equations, 11 figures, 10 tables.

Introduction
Related Work
Recipe generation from visual observations
Video captioning
Oracle-based Analysis of the Existing DVC Model
Quantitative evaluation
Qualitative evaluation
Proposed method
Event selector
Sentence generator
Multimodal memory mixing
Loss functions
Extended model
Dot-product visual simulator
Textual attention
...and 15 more sections

Figures (11)

Figure 1: A conceptual comparison of our approach and existing DVC studies. While the existing DVC models adopted parallel prediction, our approach employ multimodal recurrent prediction, which estimates events and sentences by memorizing and fusing the previously prediction results.
Figure 2: tIoU distribution of oracle events on the training and validation sets of YouCook2.
Figure 3: Comparison of the recipes generated by the oracle selection and ground truth. $N=100$, which is a default hyper-parameter of PDVC, is used in this example.
Figure 4: An introductory overview of our approach. Unlike the previous DVC approaches, we propose a multimodal recurrent learning approach to train the event selector and sentence generator. Both modules represent the previously predicted events and sentences as memory vectors and predict the next step. These memory vectors are updated and mixed to effectively share the previous prediction belonging to different modalities.
Figure 5: Multimodal recurrent learning approach of the event selector and sentence generator for recipe generation from unsegmented cooking videos. The event selector tries to choose oracle events from event candidates repeatedly (Section \ref{['subsec:event_selector']}) and the sentence generator outputs sentences for the selected events (Section \ref{['subsec:sent_generator']}). The memories are updated and mixed to effectively remember the history of the events/sentences for predicting the next step (Section \ref{['subsec:memory_update']}).
...and 6 more figures

Recipe Generation from Unsegmented Cooking Videos

TL;DR

Abstract

Recipe Generation from Unsegmented Cooking Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (11)