Table of Contents
Fetching ...

Retrieval-Augmented Egocentric Video Captioning

Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie

TL;DR

EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos, is developed and demonstrates superior performance across seven benchmarks.

Abstract

Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: https://jazzcharles.github.io/Egoinstructor/

Retrieval-Augmented Egocentric Video Captioning

TL;DR

EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos, is developed and demonstrates superior performance across seven benchmarks.

Abstract

Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: https://jazzcharles.github.io/Egoinstructor/
Paper Structure (15 sections, 5 equations, 5 figures, 5 tables)

This paper contains 15 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: EgoInstructor is a retrieval-augmented multimodal captioning model that retrieves relevant exocentric videos as references to generate the caption of egocentric videos. The cross-view retrieval ability is enabled by training on automatically constructed large-scale pseudo paired ego-exo videos.
  • Figure 2: An overview of our EgoInstructor. Given an egocentric video, we first retrieve relevant exocentric instructional videos using a frozen cross-view retrieval module pre-trained on pseudo ego-exo pairs generated automatically. The multimodal captioning model (consisting of a visual encoder, a perceiver resampler, and a text decoder.) takes the egocentric video and the retrieved videos and captions as references, and generates the caption of the ego-video.
  • Figure 3: Our cross-view retrieval module trained via EgoExoNCE loss. We keep the egocentric and exocentric video encoders frozen and train the cross-view video encoder and text encoder.
  • Figure 4: An illustration of context-aware caption refinement (left) and cross-view pair construction (right). The ASR transcripts of instructional videos are concatenated and refined by a LLM to match the descriptive style of manually labelled captions in Ego4d. We construct the ego-exo pairs by choosing the ego and exo captions that describe the similar action (e.g., toast the bread).
  • Figure 5: Visualisation results. For each egocentric video, we show two retrieved third-person instructional videos and their original ASR/refined captions. By leveraging retrieved exocentric samples, the generated captions capture correct actions and interacting objects.