Retrieval-Augmented Egocentric Video Captioning

Jilan Xu; Yifei Huang; Junlin Hou; Guo Chen; Yuejie Zhang; Rui Feng; Weidi Xie

Retrieval-Augmented Egocentric Video Captioning

Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie

TL;DR

EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos, is developed and demonstrates superior performance across seven benchmarks.

Abstract

Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: https://jazzcharles.github.io/Egoinstructor/

Retrieval-Augmented Egocentric Video Captioning

TL;DR

Abstract

Paper Structure (15 sections, 5 equations, 5 figures, 5 tables)

This paper contains 15 sections, 5 equations, 5 figures, 5 tables.

Introduction
Related Work
Methodology
Cross-view Visual Representation Alignment
Architecture Detail
Automatic Ego-Exo Pair Generation
Retrieval Module Training and Inference
Retrieval-augmented Captioning
Experiments
Experimental Setups
Experimental Results
Results on Cross-view Retrieval
Results on Retrieval-augmented Captioning
Qualitative Results
Conclusion

Figures (5)

Figure 1: EgoInstructor is a retrieval-augmented multimodal captioning model that retrieves relevant exocentric videos as references to generate the caption of egocentric videos. The cross-view retrieval ability is enabled by training on automatically constructed large-scale pseudo paired ego-exo videos.
Figure 2: An overview of our EgoInstructor. Given an egocentric video, we first retrieve relevant exocentric instructional videos using a frozen cross-view retrieval module pre-trained on pseudo ego-exo pairs generated automatically. The multimodal captioning model (consisting of a visual encoder, a perceiver resampler, and a text decoder.) takes the egocentric video and the retrieved videos and captions as references, and generates the caption of the ego-video.
Figure 3: Our cross-view retrieval module trained via EgoExoNCE loss. We keep the egocentric and exocentric video encoders frozen and train the cross-view video encoder and text encoder.
Figure 4: An illustration of context-aware caption refinement (left) and cross-view pair construction (right). The ASR transcripts of instructional videos are concatenated and refined by a LLM to match the descriptive style of manually labelled captions in Ego4d. We construct the ego-exo pairs by choosing the ego and exo captions that describe the similar action (e.g., toast the bread).
Figure 5: Visualisation results. For each egocentric video, we show two retrieved third-person instructional videos and their original ASR/refined captions. By leveraging retrieved exocentric samples, the generated captions capture correct actions and interacting objects.

Retrieval-Augmented Egocentric Video Captioning

TL;DR

Abstract

Retrieval-Augmented Egocentric Video Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)