Table of Contents
Fetching ...

Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning

Zi-Yi Dou, Xitong Yang, Tushar Nagarajan, Huiyu Wang, Jing Huang, Nanyun Peng, Kris Kitani, Fu-Jen Chu

TL;DR

EMBED addresses the underutilization of exocentric video-language data for egocentric representation learning by transforming exocentric data into egocentric-style data through HOI-focused curation and narration-style transfer. It introduces a dual-path narration framework (exo-to-ego rephraser and ego narrator) and combines curated exocentric data with Ego4D for joint vision-language pretraining using a contrastive InfoNCE objective. Empirical results show state-of-the-art zero-shot performance on EK-100 MIR and EGTEA, strong EGTEA and EgoMCQ/NLQ/MQ results, and robust generalization across multiple exocentric datasets. The work demonstrates the practical value of cross-view data transformation for scalable, generalizable video-language pretraining beyond egocentric-only data.

Abstract

We present EMBED (Egocentric Models Built with Exocentric Data), a method designed to transform exocentric video-language data for egocentric video representation learning. Large-scale exocentric data covers diverse activities with significant potential for egocentric learning, but inherent disparities between egocentric and exocentric data pose challenges in utilizing one view for the other seamlessly. Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities. Additionally, narratives in egocentric datasets are typically more action-centric and closely linked with the visual content, in contrast to the narrative styles found in exocentric datasets. To address these challenges, we employ a data transformation framework to adapt exocentric data for egocentric training, focusing on identifying specific video clips that emphasize hand-object interactions and transforming narration styles to align with egocentric perspectives. By applying both vision and language style transfer, our framework creates a new egocentric dataset derived from exocentric video-language data. Through extensive evaluations, we demonstrate the effectiveness of EMBED, achieving state-of-the-art results across various egocentric downstream tasks, including an absolute improvement of 4.7% on the Epic-Kitchens-100 multi-instance retrieval and 6.2% on the EGTEA classification benchmarks in zero-shot settings. Furthermore, EMBED enables egocentric video-language models to perform competitively in exocentric tasks. Finally, we showcase EMBED's application across various exocentric datasets, exhibiting strong generalization capabilities when applied to different exocentric datasets.

Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning

TL;DR

EMBED addresses the underutilization of exocentric video-language data for egocentric representation learning by transforming exocentric data into egocentric-style data through HOI-focused curation and narration-style transfer. It introduces a dual-path narration framework (exo-to-ego rephraser and ego narrator) and combines curated exocentric data with Ego4D for joint vision-language pretraining using a contrastive InfoNCE objective. Empirical results show state-of-the-art zero-shot performance on EK-100 MIR and EGTEA, strong EGTEA and EgoMCQ/NLQ/MQ results, and robust generalization across multiple exocentric datasets. The work demonstrates the practical value of cross-view data transformation for scalable, generalizable video-language pretraining beyond egocentric-only data.

Abstract

We present EMBED (Egocentric Models Built with Exocentric Data), a method designed to transform exocentric video-language data for egocentric video representation learning. Large-scale exocentric data covers diverse activities with significant potential for egocentric learning, but inherent disparities between egocentric and exocentric data pose challenges in utilizing one view for the other seamlessly. Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities. Additionally, narratives in egocentric datasets are typically more action-centric and closely linked with the visual content, in contrast to the narrative styles found in exocentric datasets. To address these challenges, we employ a data transformation framework to adapt exocentric data for egocentric training, focusing on identifying specific video clips that emphasize hand-object interactions and transforming narration styles to align with egocentric perspectives. By applying both vision and language style transfer, our framework creates a new egocentric dataset derived from exocentric video-language data. Through extensive evaluations, we demonstrate the effectiveness of EMBED, achieving state-of-the-art results across various egocentric downstream tasks, including an absolute improvement of 4.7% on the Epic-Kitchens-100 multi-instance retrieval and 6.2% on the EGTEA classification benchmarks in zero-shot settings. Furthermore, EMBED enables egocentric video-language models to perform competitively in exocentric tasks. Finally, we showcase EMBED's application across various exocentric datasets, exhibiting strong generalization capabilities when applied to different exocentric datasets.
Paper Structure (57 sections, 2 equations, 6 figures, 12 tables)

This paper contains 57 sections, 2 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Despite the domain difference, exocentric data can contain egocentric cues such as hand-object interaction information in vision and language modalities. Our Embed method extracts and leverages these cues, transforming exocentric video-language data for egocentric representation learning.
  • Figure 2: Given an exocentric dataset, Embed selects video clips featuring hand-object interactions (HOI) and further refines these selections by focusing on HOI regions to offer a close-up view. Additionally, we pair each exocentric clip with narrations emphasizing human actions, akin to those in egocentric data. This is achieved by using a narrator model trained on egocentric data; also, we employ an exo-to-ego rephraser model that converts existing sentences into action-oriented narrations that reflect an egocentric perspective.
  • Figure 3: The HOI detector can accurately extract the right hand (R-P), left hand (L-P) and object (O) regions from a video frame.
  • Figure 4: Video clips with high and low HOI scores. Videos with high HOI scores typically contain close-up hand-object interactions whereas videos with low HOI scores do not capture any human actions.
  • Figure 5: Demonstrate of HOI region spatial focus. Given a video clip, we extract the hand (in red and blue) and object (in orange) regions from each frame. We then compute the convex hull of all the boxes (in green) and crop the regions.
  • ...and 1 more figures