Table of Contents
Fetching ...

TRACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance

Quynh Phung, Long Mai, Cusuh Ham, Feng Liu, Jia-Bin Huang, Aniruddha Mahapatra

Abstract

We study object motion path editing in videos, where the goal is to alter a target object's trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.

TRACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance

Abstract

We study object motion path editing in videos, where the goal is to alter a target object's trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.

Paper Structure

This paper contains 17 sections, 3 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Trace Overview. Our system consists of two modules. First, the Cross-View Motion Transformation Module converts a user's 2D input path drawn on the first-frame into a scene- and camera-aware bounding-box trajectory in the video view. Second, the Video Re-Synthesis Module uses this transformed box sequence to guide the generation of a new video where the object follows the desired path while inpainting the original path of the object and preserving the other parts of the input video. Both models in the figure are diffusion models.
  • Figure 2: Video Re-Synthesis Module. Our model employs a Diffusion Transformer (DiT) backbone conditioned on the first frame, masked video, and binary masks (object and inpainting).
  • Figure 3: Cross-View Motion Transformation Comparison. We compare our cross-view motion transformation against three baselines: (1) simple Interpolation of bounding boxes, (2) and 3D warping using MegaSAM megasam to estimate depth and camera pose.Left: Shows the first frame path and generated video-view bounding boxes on our Cross-View Motion Transformation Module evaluation set. Only the one generated with our method Trace accurately translates the first-frame view bounding boxes to the video-view. Right: The user's intended path moves the fish towards the static red coral (used as an anchor). Interpolation produces an off-track path to the right. Existing 3D warping methods yield incorrect paths due to noisy depth and pose estimation. Our cross-view transformation generates a smooth, stable, and accurate path that delivers the fish box into the coral as intended.
  • Figure 4: Comparison of Video Re-Synthesis Baselines .Given an input video with a region (the original object) masked out and conditioned on the appearance of a reference object, the model must regenerate the object within the masked region. The goal is to produce a high-fidelity video that accurately restores the object while maintaining its original identity and temporal consistency.
  • Figure 5: Full pipeline comparison. Our Cross-View Motion Transformation module converts a user's path (defined by the first and last boxes in the first frame) into a sequence of video boxes. This sequence is then used to guide our re-synthesis model and three baselines (which first require a separate inpainting step). For Image-to-Video Baselines, we only use a sequence of video boxes and background point tracks (only for Motion Canvas).
  • ...and 5 more figures