EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

Runjia Li; Moayed Haji-Ali; Ashkan Mirzaei; Chaoyang Wang; Arpit Sahni; Ivan Skorokhodov; Aliaksandr Siarohin; Tomas Jakab; Junlin Han; Sergey Tulyakov; Philip Torr; Willi Menapace

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siarohin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip Torr, Willi Menapace

TL;DR

The paper tackles the gap in egocentric, instruction-guided video editing for AR by introducing EgoEditData (a large, hand-preserving egocentric editing dataset), EgoEdit (a real-time video editor trained with flow-matching and distillation for low-latency, streaming inference), and EgoEditBench (a standardized egocentric-editing benchmark). It demonstrates that a data- and distillation-driven approach yields temporally stable, instruction-faithful edits with interactive latency, outperforming baselines on egocentric tasks while remaining competitive on general editing tasks. The work provides a publicly releasable ecosystem to accelerate research in interactive generative systems for augmented reality.

Abstract

We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges - including rapid egomotion and frequent hand-object interactions - that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction. To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion. Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks-where existing methods struggle-while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community. See our website at https://snap-research.github.io/EgoEdit

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

TL;DR

Abstract

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)