Table of Contents
Fetching ...

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku

TL;DR

COM Kitchens provides the first overhead-view, unedited smartphone cooking videos with richly annotated visual action graphs, enabling two novel vision-language benchmarks: Online Recipe Retrieval (OnRR) and Dense Video Captioning on overhead-view videos (DVC-OV). By collecting 145 videos across 70 kitchens and annotating 6,826 bounding boxes with 8,061 relations, the work demonstrates a realistic, long-form procedural domain that challenges current web-video–trained models. OnRR reveals difficulties in online cross-modal retrieval for feasible recipe selection, while DVC-OV shows domain gaps for dense captioning and how action-graph supervision can improve alignment and caption quality. The dataset supports future research in procedural understanding, long-horizon reasoning, and domain-specific pretraining, with potential applications in smartphone-based queryable cooking guides and memory-augmented episodic reasoning for procedural tasks.

Abstract

Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query instructional contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we propose the novel video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video captioning domain Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks.

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

TL;DR

COM Kitchens provides the first overhead-view, unedited smartphone cooking videos with richly annotated visual action graphs, enabling two novel vision-language benchmarks: Online Recipe Retrieval (OnRR) and Dense Video Captioning on overhead-view videos (DVC-OV). By collecting 145 videos across 70 kitchens and annotating 6,826 bounding boxes with 8,061 relations, the work demonstrates a realistic, long-form procedural domain that challenges current web-video–trained models. OnRR reveals difficulties in online cross-modal retrieval for feasible recipe selection, while DVC-OV shows domain gaps for dense captioning and how action-graph supervision can improve alignment and caption quality. The dataset supports future research in procedural understanding, long-horizon reasoning, and domain-specific pretraining, with potential applications in smartphone-based queryable cooking guides and memory-augmented episodic reasoning for procedural tasks.

Abstract

Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query instructional contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we propose the novel video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video captioning domain Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks.
Paper Structure (28 sections, 1 equation, 13 figures, 7 tables)

This paper contains 28 sections, 1 equation, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Sample from COM Kitchens: the dataset includes unedited overhead-view cooking videos, each manually annotated with a visual action graph that links instructional texts to visual elements through edges from before ($\Box$) to after ($\Box$) bounding boxes (BBs). dist. BBs ($\Box$) represent mixing. Details are provided in \ref{['ss:def']}.
  • Figure 2: A partial view of our visual action graph: AP7 consists of two sub-APs (7-1 and 7-2). All bounding boxes (BBs) mark foods (e.g., the destination BB in AP7-1 is oil heated in AP6-1). The duration is defined by the first and last BBs of the sub-APs.
  • Figure 3: Distributions of duration; the averages are 16.6 min and 46.7 sec. for videos and APs, respectively.
  • Figure 4: Distribution of the length of word sequences; the averages are 87.2 and 13.3 words for recipes and sentences.
  • Figure 6: Data related to OnRR: The query of OnRR sub-tasks is the first $Z$% of a video. For feasible recipe retrieval, we added an extra recipe resource to enhance the dataset of retrieval targets alongside our test set.
  • ...and 8 more figures