Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos
Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato
TL;DR
This work tackles dense video captioning for egocentric videos, a domain hampered by data scarcity, by proposing cross-view transfer from exocentric web videos (YouCook2) to egocentric recordings (EgoYC2). The authors introduce EgoYC2 and a view-invariant learning framework that uses a feature converter $F$ and a view classifier $C$ with a gradient reversal layer to minimize the adversarial loss $L_{adv}$ while optimizing the task loss $L_{task}$; they implement a gradual domain adaptation with an intermediate ego-like view and perform pre-training (VI-PT) followed by fine-tuning (VI-FT) on both source and target data. Hand-object features, combined with hand tracking and segmentation, further improve model robustness to egocentric motion, and a unified PDVC-based captioning network provides coherent time-segment localization and description generation. Quantitative and qualitative analyses on YC2 → EgoYC2 demonstrate that the view-invariant approach, particularly when augmented with hand-object cues, yields significant gains over naive transfer and standard domain adaptation methods, validating the practicality of cross-view dense captioning. The benchmark and methodology pave the way for robust, natural-language descriptions of real-world egocentric procedural activities, with potential impact on assistive tech, AR interfaces, and human-robot collaboration.
Abstract
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain shots showing either full-body or hand regions, while the egocentric view is constantly shifting. This necessitates the in-depth study of cross-view transfer under complex view changes. To this end, we first create a real-life egocentric dataset (EgoYC2) whose captions follow the definition of YouCook2 captions, enabling transfer learning between these datasets with access to their ground-truth. To bridge the view gaps, we propose a view-invariant learning method using adversarial training, which consists of pre-training and fine-tuning stages. Our experiments confirm the effectiveness of overcoming the view change problem and knowledge transfer to egocentric views. Our benchmark pushes the study of cross-view transfer into a new task domain of dense video captioning and envisions methodologies that describe egocentric videos in natural language.
