Table of Contents
Fetching ...

PARSE-Ego4D: Personal Action Recommendation Suggestions for Egocentric Videos

Steven Abreu, Tiffany D. Do, Karan Ahuja, Eric J. Gonzalez, Lee Payne, Daniel McDuff, Mar Gonzalez-Franco

TL;DR

PARSE-Ego4D addresses the lack of proactive action recommendations in egocentric video data by introducing a two-stage annotation pipeline that blends LLM-generated, context-aware suggestions with large-scale human grounding. Built on Ego4D, the dataset yields 18,360 candidate recommendations with rich provenance (context, (query,action) pairs, LLM provenance, and rationale) and supports two benchmarks: Explicit Query-to-Action and Implicit Query-to-Action. Human evaluation demonstrates substantial agreement and favorable ratings (65% >3, 42% >4) while baseline LLMs show promise for on-device AR/VR assistance. This resource paves the way for proactive, personalized egocentric assistants, with attention to efficiency, UI design, and potential societal impacts.

Abstract

Intelligent assistance involves not only understanding but also action. Existing ego-centric video datasets contain rich annotations of the videos, but not of actions that an intelligent assistant could perform in the moment. To address this gap, we release PARSE-Ego4D, a new set of personal action recommendation annotations for the Ego4D dataset. We take a multi-stage approach to generating and evaluating these annotations. First, we used a prompt-engineered large language model (LLM) to generate context-aware action suggestions and identified over 18,000 action suggestions. While these synthetic action suggestions are valuable, the inherent limitations of LLMs necessitate human evaluation. To ensure high-quality and user-centered recommendations, we conducted a large-scale human annotation study that provides grounding in human preferences for all of PARSE-Ego4D. We analyze the inter-rater agreement and evaluate subjective preferences of participants. Based on our synthetic dataset and complete human annotations, we propose several new tasks for action suggestions based on ego-centric videos. We encourage novel solutions that improve latency and energy requirements. The annotations in PARSE-Ego4D will support researchers and developers who are working on building action recommendation systems for augmented and virtual reality systems.

PARSE-Ego4D: Personal Action Recommendation Suggestions for Egocentric Videos

TL;DR

PARSE-Ego4D addresses the lack of proactive action recommendations in egocentric video data by introducing a two-stage annotation pipeline that blends LLM-generated, context-aware suggestions with large-scale human grounding. Built on Ego4D, the dataset yields 18,360 candidate recommendations with rich provenance (context, (query,action) pairs, LLM provenance, and rationale) and supports two benchmarks: Explicit Query-to-Action and Implicit Query-to-Action. Human evaluation demonstrates substantial agreement and favorable ratings (65% >3, 42% >4) while baseline LLMs show promise for on-device AR/VR assistance. This resource paves the way for proactive, personalized egocentric assistants, with attention to efficiency, UI design, and potential societal impacts.

Abstract

Intelligent assistance involves not only understanding but also action. Existing ego-centric video datasets contain rich annotations of the videos, but not of actions that an intelligent assistant could perform in the moment. To address this gap, we release PARSE-Ego4D, a new set of personal action recommendation annotations for the Ego4D dataset. We take a multi-stage approach to generating and evaluating these annotations. First, we used a prompt-engineered large language model (LLM) to generate context-aware action suggestions and identified over 18,000 action suggestions. While these synthetic action suggestions are valuable, the inherent limitations of LLMs necessitate human evaluation. To ensure high-quality and user-centered recommendations, we conducted a large-scale human annotation study that provides grounding in human preferences for all of PARSE-Ego4D. We analyze the inter-rater agreement and evaluate subjective preferences of participants. Based on our synthetic dataset and complete human annotations, we propose several new tasks for action suggestions based on ego-centric videos. We encourage novel solutions that improve latency and energy requirements. The annotations in PARSE-Ego4D will support researchers and developers who are working on building action recommendation systems for augmented and virtual reality systems.
Paper Structure (26 sections, 6 figures, 3 tables)

This paper contains 26 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Examples of action suggestions for different videos in the PARSE-Ego4D dataset.
  • Figure 2: PARSE-Ego4D - We curated, annotated and open-source over 11,000 action suggestions for the Ego4D dataset. These annotations support researchers and developers who are working on building personalized action recommendation systems for augmented and virtual reality systems.
  • Figure 3: Left: Suggested actions by type. Right: Score distribution for different questions in the human annotation study, showing that there are more valid explicit suggestions than implicit suggestions.
  • Figure 4: Sketch of the survey that participants filled out in the human annotation study in order to verify the synthetically generated action suggestions in PARSE-Ego4D.
  • Figure 5: A demographic breakdown of our participants in the annotation study, including ethnicity, gender, and age. Countries with fewer than 15 participants are listed in "Other".
  • ...and 1 more figures