Table of Contents
Fetching ...

Learning to Generate Long-term Future Narrations Describing Activities of Daily Living

Ramanathan Rajendiran, Debaditya Roy, Basura Fernando

TL;DR

This work introduces long-term future narration generation for daily activities and formalizes it as generating a sequence of future narrations conditioned on observed egocentric video frames and narrations. It presents ViNa, a visual–language model that uses a frozen CLIP-based frame encoder, learned video tokens via cross-attention, and a GPT-2–style decoder to autoregressively generate 20+ future narrations. The approach is evaluated on Ego4D with a new large-scale benchmark and standard captioning metrics, and a downstream future video retrieval task demonstrates practical utility for planning and decision support. Across strong baselines, ViNa, particularly with a T5 or GPT-2 decoder, achieves superior or competitive performance, illustrating both the feasibility and challenges of open-ended long-horizon narration forecasting for daily living activities.

Abstract

Anticipating future events is crucial for various application domains such as healthcare, smart home technology, and surveillance. Narrative event descriptions provide context-rich information, enhancing a system's future planning and decision-making capabilities. We propose a novel task: $\textit{long-term future narration generation}$, which extends beyond traditional action anticipation by generating detailed narrations of future daily activities. We introduce a visual-language model, ViNa, specifically designed to address this challenging task. ViNa integrates long-term videos and corresponding narrations to generate a sequence of future narrations that predict subsequent events and actions over extended time horizons. ViNa extends existing multimodal models that perform only short-term predictions or describe observed videos by generating long-term future narrations for a broader range of daily activities. We also present a novel downstream application that leverages the generated narrations called future video retrieval to help users improve planning for a task by visualizing the future. We evaluate future narration generation on the largest egocentric dataset Ego4D.

Learning to Generate Long-term Future Narrations Describing Activities of Daily Living

TL;DR

This work introduces long-term future narration generation for daily activities and formalizes it as generating a sequence of future narrations conditioned on observed egocentric video frames and narrations. It presents ViNa, a visual–language model that uses a frozen CLIP-based frame encoder, learned video tokens via cross-attention, and a GPT-2–style decoder to autoregressively generate 20+ future narrations. The approach is evaluated on Ego4D with a new large-scale benchmark and standard captioning metrics, and a downstream future video retrieval task demonstrates practical utility for planning and decision support. Across strong baselines, ViNa, particularly with a T5 or GPT-2 decoder, achieves superior or competitive performance, illustrating both the feasibility and challenges of open-ended long-horizon narration forecasting for daily living activities.

Abstract

Anticipating future events is crucial for various application domains such as healthcare, smart home technology, and surveillance. Narrative event descriptions provide context-rich information, enhancing a system's future planning and decision-making capabilities. We propose a novel task: , which extends beyond traditional action anticipation by generating detailed narrations of future daily activities. We introduce a visual-language model, ViNa, specifically designed to address this challenging task. ViNa integrates long-term videos and corresponding narrations to generate a sequence of future narrations that predict subsequent events and actions over extended time horizons. ViNa extends existing multimodal models that perform only short-term predictions or describe observed videos by generating long-term future narrations for a broader range of daily activities. We also present a novel downstream application that leverages the generated narrations called future video retrieval to help users improve planning for a task by visualizing the future. We evaluate future narration generation on the largest egocentric dataset Ego4D.

Paper Structure

This paper contains 22 sections, 1 equation, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Narrations offer a detailed account of events, capturing multiple interactions between individuals and objects. Unlike verb-noun depictions of actions, narrations deliver a detailed perspective of an event within a video. By focusing on generating future narratives instead of mere actions, we gain a richer understanding of what's to come. In this context, C represents the person wearing the camera.
  • Figure 2: Proposed model ViNa for video-conditioned future narration generation. A frozen CLIP image encoder is used to obtain a frame representation of the observed video frames. Learned queries attend to the frame representation to obtain learned video tokens that summarize the events in the video. A decoder such as GPT-2 takes the learned video tokens and observed narration language tokens to generate 20+ future narrations.
  • Figure 3: (Left): Effect of number of learned video tokens on different generation metrics. Performance peaks at 48 learned video tokens. (Right): Effect of number of decoded tokens on different generation metrics. Performance peaks around 200 tokens.
  • Figure 4: ViNa-GPT2 maintains continuity of observed narrations and matches the future narrations better than Vid2Seq. Green shows matched narrations with ground truth.
  • Figure 5: Future narrations generated by ViNa-GPT2 are consistent with the visual context provided by the observed video. Green shows generated narrations that match with GT. Blue indicates alternate future narrations not present in GT but plausible given the observed video.
  • ...and 6 more figures