Table of Contents
Fetching ...

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi

Abstract

We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Abstract

We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.
Paper Structure (44 sections, 4 equations, 21 figures, 7 tables, 1 algorithm)

This paper contains 44 sections, 4 equations, 21 figures, 7 tables, 1 algorithm.

Figures (21)

  • Figure 1: Left:Search2Motion is a training-free pipeline for object-level motion editing. Given a single image and a user-specified target location, Search2Motion constructs a target frame and leverages pretrained FLF2V motion priors to synthesize realistic object motion, without retraining or auxiliary control signals. Right: Sample pairs from the Search2Motion Benchmark, two stable-camera datasets for object-only motion evaluation.
  • Figure 2: The Search2Motion Pipeline is constructed with three components, where the user can interact with the application at the target frame construction stage (Background Inpainting and Object Placement). Then the original input image and the user-edited last frame are sent to a first-frame last-frame (FLF2V) video generator to generate the final video based on the given input image and user preference, while utilizing ACE-Seed, a novel search criterion in the noise space based on attention-consensus, to improve the generated video quality automatically.
  • Figure 3: Object trajectory between first (yellow point) and last frame (red point). The upper pair of frames is extracted from raw video in DAVIS_Perazzi2016, and the lower pair is from our synthesized dataset, S2M-DAVIS.
  • Figure 4: Qualitative examples for object replacement using state-of-the-art image editing tools, Qwen-Image-Editwu2025qwenimagetechnicalreport (left) and FLUX-Kontextmachalek-2020-kontext (right).
  • Figure 5: FLF2V-obj metrics provides object-centric insight by isolating the object from the scene and evaluating object consistency across the generated sequence. Search2Motion produces high-fidelity object movement and maintains object consistency across the generated sequence compared to DragAnything.
  • ...and 16 more figures