Table of Contents
Fetching ...

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato

TL;DR

This work tackles dense video captioning for egocentric videos, a domain hampered by data scarcity, by proposing cross-view transfer from exocentric web videos (YouCook2) to egocentric recordings (EgoYC2). The authors introduce EgoYC2 and a view-invariant learning framework that uses a feature converter $F$ and a view classifier $C$ with a gradient reversal layer to minimize the adversarial loss $L_{adv}$ while optimizing the task loss $L_{task}$; they implement a gradual domain adaptation with an intermediate ego-like view and perform pre-training (VI-PT) followed by fine-tuning (VI-FT) on both source and target data. Hand-object features, combined with hand tracking and segmentation, further improve model robustness to egocentric motion, and a unified PDVC-based captioning network provides coherent time-segment localization and description generation. Quantitative and qualitative analyses on YC2 → EgoYC2 demonstrate that the view-invariant approach, particularly when augmented with hand-object cues, yields significant gains over naive transfer and standard domain adaptation methods, validating the practicality of cross-view dense captioning. The benchmark and methodology pave the way for robust, natural-language descriptions of real-world egocentric procedural activities, with potential impact on assistive tech, AR interfaces, and human-robot collaboration.

Abstract

We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain shots showing either full-body or hand regions, while the egocentric view is constantly shifting. This necessitates the in-depth study of cross-view transfer under complex view changes. To this end, we first create a real-life egocentric dataset (EgoYC2) whose captions follow the definition of YouCook2 captions, enabling transfer learning between these datasets with access to their ground-truth. To bridge the view gaps, we propose a view-invariant learning method using adversarial training, which consists of pre-training and fine-tuning stages. Our experiments confirm the effectiveness of overcoming the view change problem and knowledge transfer to egocentric views. Our benchmark pushes the study of cross-view transfer into a new task domain of dense video captioning and envisions methodologies that describe egocentric videos in natural language.

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

TL;DR

This work tackles dense video captioning for egocentric videos, a domain hampered by data scarcity, by proposing cross-view transfer from exocentric web videos (YouCook2) to egocentric recordings (EgoYC2). The authors introduce EgoYC2 and a view-invariant learning framework that uses a feature converter and a view classifier with a gradient reversal layer to minimize the adversarial loss while optimizing the task loss ; they implement a gradual domain adaptation with an intermediate ego-like view and perform pre-training (VI-PT) followed by fine-tuning (VI-FT) on both source and target data. Hand-object features, combined with hand tracking and segmentation, further improve model robustness to egocentric motion, and a unified PDVC-based captioning network provides coherent time-segment localization and description generation. Quantitative and qualitative analyses on YC2 → EgoYC2 demonstrate that the view-invariant approach, particularly when augmented with hand-object cues, yields significant gains over naive transfer and standard domain adaptation methods, validating the practicality of cross-view dense captioning. The benchmark and methodology pave the way for robust, natural-language descriptions of real-world egocentric procedural activities, with potential impact on assistive tech, AR interfaces, and human-robot collaboration.

Abstract

We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain shots showing either full-body or hand regions, while the egocentric view is constantly shifting. This necessitates the in-depth study of cross-view transfer under complex view changes. To this end, we first create a real-life egocentric dataset (EgoYC2) whose captions follow the definition of YouCook2 captions, enabling transfer learning between these datasets with access to their ground-truth. To bridge the view gaps, we propose a view-invariant learning method using adversarial training, which consists of pre-training and fine-tuning stages. Our experiments confirm the effectiveness of overcoming the view change problem and knowledge transfer to egocentric views. Our benchmark pushes the study of cross-view transfer into a new task domain of dense video captioning and envisions methodologies that describe egocentric videos in natural language.
Paper Structure (16 sections, 4 equations, 10 figures, 6 tables)

This paper contains 16 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Our cross-view knowledge transfer of dense video captioning. We propose to utilize existing web instructional videos with exocentric views, YouCook2 (YC2) zhou:aaai18, to improve dense video captioning on newly recorded egocentric videos (EgoYC2). The EgoYC2's captions are annotated by following YC2, enabling the study of transfer learning under view gaps in videos.
  • Figure 2: View-invariant learning across exocentric and egocentric views. (i) We define an intermediate view (ego-like) in the source domain, which represents the one between exo and ego views. We treat source images where the face is detected as the exo view and the others as the ego-like view due to its similarity to the ego view. We generate video features using a fixed encoder $\phi$ and describe this processing for egocentric videos in \ref{['sec:hofeat']}. (ii) We design our view-invariant (VI) learning to gradually adapt from exo to ego views. Our method consists of pre-training (PT) on the source data and fine-tuning (FT) across the source and target data. Following adversarial domain adaptation ganin:icml15, we train a feature converter $F$ and a view classifier $C$ with a gradient reversal layer (GRL). This encourages feature learning invariant to the view classes to be classified by $C$. The former PT takes the source data with the exo and ego-like classes, while the latter FT takes all views to align them.
  • Figure 3: Baseline for egocentric dense video captioning. Our baseline consists of (i) hand-object encoding and (ii) one-stage captioning with parallel decoding (PDVC wang:iccv21). We first preprocess the egocentric videos with hand detection ("crop area") and hand-object segmentation ("hands", "1st obj.", and "2nd obj."). We extract features for these regions by the fixed encoder $\phi$ and pass their concatenated features to the feature converter $F$. The generated video features are fed to a transformer-based captioning model with two prediction heads of time segment and caption.
  • Figure 4: Time segmentation by detected AR markers. In the transition of cooking steps, we ask participants to check the next step on their smartphone or tablet and display an AR marker once they confirm the next step. Given a recorded video, we postprocess it to detect the marker and segment the video temporally.
  • Figure 5: Qualitative results (recipe: scrambled eggs). We show generated captions given time segment proposals from prediction or ground-truth. We compare our ablation models: view-invariant (VI) pre-training (PT) and/or view-invariant (VI) fine-tuning (FT). The marks $\square$ and $\vartriangle$ indicate failure cases for irrelevant ingredients and duplicate captions.
  • ...and 5 more figures