Table of Contents
Fetching ...

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon

TL;DR

AlanaVLM is presented, a 7B parameter VLM trained using parameter-efficient methods on EVUD, which achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%.

Abstract

AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI.

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

TL;DR

AlanaVLM is presented, a 7B parameter VLM trained using parameter-efficient methods on EVUD, which achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%.

Abstract

AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI.
Paper Structure (36 sections, 11 figures, 3 tables)

This paper contains 36 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Egocentric Video Understanding Dataset (EVUD): a collection of egocentric video caption generation and video question-answering tasks that can be used for instruction-tuning video-based VLMs.
  • Figure 2: EVUD is built ensuring that the majority of examples focus on visual question answering (Ego4D VQA, Ego4D VQA Gemini and VSR), as well as image captioning (HM3D and EgoClip).
  • Figure 3: Human error analysis performed on 98 QA pairs on OpenEQA.
  • Figure 4: Length distribution of Ego4D NLQ clips.
  • Figure 5: Results of human evaluation on 1,400 examples. The percentage of appropriate question, appropriate category, and correct answer are shown on a per category basis. Text labels show the percentage of questions/categories/answers in each category found to be appropriate and/or correct.
  • ...and 6 more figures