Table of Contents
Fetching ...

Point-to-Point: Sparse Motion Guidance for Controllable Video Editing

Yeji Song, Jaehyun Lee, Mijin Koo, JunHoo Lee, Nojun Kwak

TL;DR

This work introduces anchor tokens, a sparse, automated motion representation derived from a pre-trained video diffusion model, to guide editing while preserving source motion. By collecting and selecting representative token trajectories with Farthest Point Sampling and aligning them to new subjects, Point-to-Point achieves robust motion transfer across diverse subjects without manual keypoints. Extensive quantitative and human studies show improved joint edit and motion fidelity and strong generalization, outperforming signal-based and adaptation-based baselines, including open-world pose estimators. The approach offers practical, layout-agnostic video editing with broad applicability to customized subject swapping and multi-subject scenes, marking a notable advancement in motion-aware video editing.

Abstract

Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.

Point-to-Point: Sparse Motion Guidance for Controllable Video Editing

TL;DR

This work introduces anchor tokens, a sparse, automated motion representation derived from a pre-trained video diffusion model, to guide editing while preserving source motion. By collecting and selecting representative token trajectories with Farthest Point Sampling and aligning them to new subjects, Point-to-Point achieves robust motion transfer across diverse subjects without manual keypoints. Extensive quantitative and human studies show improved joint edit and motion fidelity and strong generalization, outperforming signal-based and adaptation-based baselines, including open-world pose estimators. The approach offers practical, layout-agnostic video editing with broad applicability to customized subject swapping and multi-subject scenes, marking a notable advancement in motion-aware video editing.

Abstract

Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.

Paper Structure

This paper contains 23 sections, 6 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Comparison of editing results given a source video (left). The left-bottom boxes show the estimated optical flow teed2020raft from generated videos. (a) Signal-based method cong2023flatten produces overfitted layouts. (b) Adaptation-based method zhao2024motiondirector produces inaccurate motion. (c) Point-based method gu2024videoswap generates inaccurate motion when the points fail to capture meaningful motion. Ours successfully edits the subject while preserving the motion.
  • Figure 2: Given a source video, our Point-to-Point extracts compact and essential motion information using point trajectories, and transfers it to guide edits across diverse target subjects, including humans, animals, objects, and even scene-level motion.
  • Figure 3: Left: From motion tokens tracked across latent features $\tilde{z}_1, \dots, \tilde{z}_N$, we select a sparse set of anchor tokens (colored) that capture representative motion trajectories by computing similarity in feature space, denoted as $\mathcal{F}$, while filtering out redundant or out-of-subject tokens (black). Right: During editing, anchor tokens are aligned by identifying semantically corresponding locations in the edited video and injected at those positions, enabling motion transfer to the new subject.
  • Figure 4: Quantitative comparison on edit fidelity (x-axis) and motion fidelity (y-axis)
  • Figure 5: Qualitative comparison of video editing results across various videos.
  • ...and 7 more figures