Learning to Generate Long-term Future Narrations Describing Activities of Daily Living
Ramanathan Rajendiran, Debaditya Roy, Basura Fernando
TL;DR
This work introduces long-term future narration generation for daily activities and formalizes it as generating a sequence of future narrations conditioned on observed egocentric video frames and narrations. It presents ViNa, a visual–language model that uses a frozen CLIP-based frame encoder, learned video tokens via cross-attention, and a GPT-2–style decoder to autoregressively generate 20+ future narrations. The approach is evaluated on Ego4D with a new large-scale benchmark and standard captioning metrics, and a downstream future video retrieval task demonstrates practical utility for planning and decision support. Across strong baselines, ViNa, particularly with a T5 or GPT-2 decoder, achieves superior or competitive performance, illustrating both the feasibility and challenges of open-ended long-horizon narration forecasting for daily living activities.
Abstract
Anticipating future events is crucial for various application domains such as healthcare, smart home technology, and surveillance. Narrative event descriptions provide context-rich information, enhancing a system's future planning and decision-making capabilities. We propose a novel task: $\textit{long-term future narration generation}$, which extends beyond traditional action anticipation by generating detailed narrations of future daily activities. We introduce a visual-language model, ViNa, specifically designed to address this challenging task. ViNa integrates long-term videos and corresponding narrations to generate a sequence of future narrations that predict subsequent events and actions over extended time horizons. ViNa extends existing multimodal models that perform only short-term predictions or describe observed videos by generating long-term future narrations for a broader range of daily activities. We also present a novel downstream application that leverages the generated narrations called future video retrieval to help users improve planning for a task by visualizing the future. We evaluate future narration generation on the largest egocentric dataset Ego4D.
