Table of Contents
Fetching ...

EgoLCD: Egocentric Video Generation with Long Context Diffusion

Liuzhou Zhang, Jiarui Ye, Yuanlei Wang, Ming Zhong, Mingju Cao, Wanke Xia, Bowen Zeng, Zeyu Zhang, Hao Tang

TL;DR

EgoLCD tackles long-context egocentric video generation by reframing generation as a memory-management problem. It introduces a dual-memory system: a Long-Term Sparse KV Cache to preserve global coherence and an attention-based short-term memory augmented with LoRA for rapid local adaptation, guided by a memory regulation loss to align with retrieved history. Structured Narrative Prompting provides temporally ordered guidance, while a Semi-AR diffusion framework enables scalable block-wise generation. Evaluations on EgoVid-5M and related data show superior temporal stability and perceptual quality, marking a significant step toward scalable world models for embodied AI.

Abstract

Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.

EgoLCD: Egocentric Video Generation with Long Context Diffusion

TL;DR

EgoLCD tackles long-context egocentric video generation by reframing generation as a memory-management problem. It introduces a dual-memory system: a Long-Term Sparse KV Cache to preserve global coherence and an attention-based short-term memory augmented with LoRA for rapid local adaptation, guided by a memory regulation loss to align with retrieved history. Structured Narrative Prompting provides temporally ordered guidance, while a Semi-AR diffusion framework enables scalable block-wise generation. Evaluations on EgoVid-5M and related data show superior temporal stability and perceptual quality, marking a significant step toward scalable world models for embodied AI.

Abstract

Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.

Paper Structure

This paper contains 33 sections, 12 equations, 3 figures, 3 tables, 3 algorithms.

Figures (3)

  • Figure 1: The porposed EgoLCD generates long-form egocentric videos that maintain coherent scene transitions and consistent object layouts.
  • Figure 2: Long-Short Memory & Structured Narrative Prompting.
  • Figure 3: The overall framework of EgoLCD.