Table of Contents
Fetching ...

EAGLE: Egocentric AGgregated Language-video Engine

Jing Bi, Yunlong Tang, Luchuan Song, Ali Vosoughi, Nguyen Nguyen, Chenliang Xu

TL;DR

This work introduces the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks and proposes a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding.

Abstract

The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In response, we introduce the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. EAGLE-400K, the \textit{first} large-scale instruction-tuning dataset tailored for egocentric video, features 400K diverse samples to enhance a broad spectrum of tasks from activity recognition to procedure knowledge learning. Moreover, EAGLE, a strong video multimodal large language model (MLLM), is designed to effectively capture both spatial and temporal information. In addition, we propose a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding. Our extensive experiments demonstrate EAGLE's superior performance over existing models, highlighting its ability to balance task-specific understanding with holistic video interpretation. With EAGLE, we aim to pave the way for research opportunities and practical applications in real-world scenarios.

EAGLE: Egocentric AGgregated Language-video Engine

TL;DR

This work introduces the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks and proposes a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding.

Abstract

The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In response, we introduce the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. EAGLE-400K, the \textit{first} large-scale instruction-tuning dataset tailored for egocentric video, features 400K diverse samples to enhance a broad spectrum of tasks from activity recognition to procedure knowledge learning. Moreover, EAGLE, a strong video multimodal large language model (MLLM), is designed to effectively capture both spatial and temporal information. In addition, we propose a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding. Our extensive experiments demonstrate EAGLE's superior performance over existing models, highlighting its ability to balance task-specific understanding with holistic video interpretation. With EAGLE, we aim to pave the way for research opportunities and practical applications in real-world scenarios.
Paper Structure (16 sections, 4 figures, 6 tables)

This paper contains 16 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An illustration of EAGLE, a framework designed to unify egocentric video tasks, thereby enhancing both inter-task and intra-task understanding.
  • Figure 2: Evaluation results of existing methods, including our EAGLE model and BLIP-2 li2023blip, BLIP-1 li2022blip, InstructBLIP dai2023instructblipetc., using the newly proposed metrics on the EAGLE-400K benchmark.
  • Figure 3: Left: Representative frames from the Ego4D grauman2022ego4d, EPIC-KITCHENS Damen2022RESCALING, and PTA datasets, showcasing the detailed capture of task-oriented activities. Right: Visualizations of trajectories and object interactions within the EAGLE-400K dataset, emphasizing the tasks' complexity and diversity.
  • Figure 4: The architecture of the EAGLE model, which includes a finely-tuned projection layer and adapter, enhancing the language model’s capability to handle complex instructions containing temporal boundaries and object location tokens. This design enables the model to accurately determine the precise temporal boundaries of events and identify the specific locations of objects within a given context.