Table of Contents
Fetching ...

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siarohin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip Torr, Willi Menapace

TL;DR

The paper tackles the gap in egocentric, instruction-guided video editing for AR by introducing EgoEditData (a large, hand-preserving egocentric editing dataset), EgoEdit (a real-time video editor trained with flow-matching and distillation for low-latency, streaming inference), and EgoEditBench (a standardized egocentric-editing benchmark). It demonstrates that a data- and distillation-driven approach yields temporally stable, instruction-faithful edits with interactive latency, outperforming baselines on egocentric tasks while remaining competitive on general editing tasks. The work provides a publicly releasable ecosystem to accelerate research in interactive generative systems for augmented reality.

Abstract

We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges - including rapid egomotion and frequent hand-object interactions - that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction. To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion. Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks-where existing methods struggle-while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community. See our website at https://snap-research.github.io/EgoEdit

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

TL;DR

The paper tackles the gap in egocentric, instruction-guided video editing for AR by introducing EgoEditData (a large, hand-preserving egocentric editing dataset), EgoEdit (a real-time video editor trained with flow-matching and distillation for low-latency, streaming inference), and EgoEditBench (a standardized egocentric-editing benchmark). It demonstrates that a data- and distillation-driven approach yields temporally stable, instruction-faithful edits with interactive latency, outperforming baselines on egocentric tasks while remaining competitive on general editing tasks. The work provides a publicly releasable ecosystem to accelerate research in interactive generative systems for augmented reality.

Abstract

We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges - including rapid egomotion and frequent hand-object interactions - that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction. To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion. Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks-where existing methods struggle-while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community. See our website at https://snap-research.github.io/EgoEdit

Paper Structure

This paper contains 35 sections, 1 equation, 13 figures, 7 tables.

Figures (13)

  • Figure 1: We propose a framework for real-time egocentric video editing. Our system is composed of: EgoEditData, a manually curated dataset of 100k video editing pairs focusing on the egocentric case and featuring object substitution and removal under challenging hand occlusions, interactions, and large egomotion; EgoEdit, the first real-time autoregressive model for egocentric video editing running in real time on a single H100 with 855ms first-frame latency and enabling live augmented reality (AR) interactions; EgoEditBench, a comprehensive benchmark for evaluation of egocentric video editing systems.
  • Figure 2: In-the-wild video edits produced in real time by EgoEdit's streaming variant EgoEdit-RT on a single H100 GPU. The model demonstrates strong generalization to out-of-distribution scenarios, producing compelling real-time results suitable for immersive AR experiences. Additional results are presented in Appx. \ref{['appx:additional_results']} and the website.
  • Figure 3: Comparison of EgoEdit and EgoEdit-RT against baselines according to VLM score on EgoEditBench and EditVerseBench ju2025editverse. EgoEdit and its real-time variant EgoEdit-RT achieve superior results on egocentric editing tasks and perform competitively with the strongest baselines on general editing tasks. EditVerse is excluded from EgoEditBench as source code is unavailable. Streaming models are indicated in dashed lines.
  • Figure 4: Architecture of EgoEdit. EgoEdit extends a video generation DiT model for video editing by performing channel-wise concatenation of the source and noisy target video inputs, avoiding the computational overheads of sequence-wise concatenation.
  • Figure 5: Inference of EgoEdit. EgoEdit performs inference in a streaming fashion. A camera continuously acquires video sequences which are edited by the model in a chunk-by-chunk manner so that the edited video can be served to the user in a watch-as-you-generate fashion. Each blue arrow represents a model forward pass on a single video chunk for the case of a 3 steps model.
  • ...and 8 more figures