Point-to-Point: Sparse Motion Guidance for Controllable Video Editing
Yeji Song, Jaehyun Lee, Mijin Koo, JunHoo Lee, Nojun Kwak
TL;DR
This work introduces anchor tokens, a sparse, automated motion representation derived from a pre-trained video diffusion model, to guide editing while preserving source motion. By collecting and selecting representative token trajectories with Farthest Point Sampling and aligning them to new subjects, Point-to-Point achieves robust motion transfer across diverse subjects without manual keypoints. Extensive quantitative and human studies show improved joint edit and motion fidelity and strong generalization, outperforming signal-based and adaptation-based baselines, including open-world pose estimators. The approach offers practical, layout-agnostic video editing with broad applicability to customized subject swapping and multi-subject scenes, marking a notable advancement in motion-aware video editing.
Abstract
Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.
