PARSE-Ego4D: Personal Action Recommendation Suggestions for Egocentric Videos
Steven Abreu, Tiffany D. Do, Karan Ahuja, Eric J. Gonzalez, Lee Payne, Daniel McDuff, Mar Gonzalez-Franco
TL;DR
PARSE-Ego4D addresses the lack of proactive action recommendations in egocentric video data by introducing a two-stage annotation pipeline that blends LLM-generated, context-aware suggestions with large-scale human grounding. Built on Ego4D, the dataset yields 18,360 candidate recommendations with rich provenance (context, (query,action) pairs, LLM provenance, and rationale) and supports two benchmarks: Explicit Query-to-Action and Implicit Query-to-Action. Human evaluation demonstrates substantial agreement and favorable ratings (65% >3, 42% >4) while baseline LLMs show promise for on-device AR/VR assistance. This resource paves the way for proactive, personalized egocentric assistants, with attention to efficiency, UI design, and potential societal impacts.
Abstract
Intelligent assistance involves not only understanding but also action. Existing ego-centric video datasets contain rich annotations of the videos, but not of actions that an intelligent assistant could perform in the moment. To address this gap, we release PARSE-Ego4D, a new set of personal action recommendation annotations for the Ego4D dataset. We take a multi-stage approach to generating and evaluating these annotations. First, we used a prompt-engineered large language model (LLM) to generate context-aware action suggestions and identified over 18,000 action suggestions. While these synthetic action suggestions are valuable, the inherent limitations of LLMs necessitate human evaluation. To ensure high-quality and user-centered recommendations, we conducted a large-scale human annotation study that provides grounding in human preferences for all of PARSE-Ego4D. We analyze the inter-rater agreement and evaluate subjective preferences of participants. Based on our synthetic dataset and complete human annotations, we propose several new tasks for action suggestions based on ego-centric videos. We encourage novel solutions that improve latency and energy requirements. The annotations in PARSE-Ego4D will support researchers and developers who are working on building action recommendation systems for augmented and virtual reality systems.
