Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Sainan Liu; Tz-Ying Wu; Hector A Valdez; Subarna Tripathi

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi

Abstract

We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Abstract

Paper Structure (44 sections, 4 equations, 21 figures, 7 tables, 1 algorithm)

This paper contains 44 sections, 4 equations, 21 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Motion Control for Video Diffusion
Fine-Tuned Motion Guidance
Frame-Based Guidance
Inference-Time Search and Seed Selection
Object-Level Evaluation: Metrics and Benchmarks
Methods
Object Motion Editing as FLF2V
Task Definition.
FLF2V Reformulation.
Semantic-guided Object Placement.
Synthesizing the Last-Frame Condition.
Robust Generation through Early-Step Attention Guidance
Early-step Trajectory Preview.
...and 29 more sections

Figures (21)

Figure 1: Left:Search2Motion is a training-free pipeline for object-level motion editing. Given a single image and a user-specified target location, Search2Motion constructs a target frame and leverages pretrained FLF2V motion priors to synthesize realistic object motion, without retraining or auxiliary control signals. Right: Sample pairs from the Search2Motion Benchmark, two stable-camera datasets for object-only motion evaluation.
Figure 2: The Search2Motion Pipeline is constructed with three components, where the user can interact with the application at the target frame construction stage (Background Inpainting and Object Placement). Then the original input image and the user-edited last frame are sent to a first-frame last-frame (FLF2V) video generator to generate the final video based on the given input image and user preference, while utilizing ACE-Seed, a novel search criterion in the noise space based on attention-consensus, to improve the generated video quality automatically.
Figure 3: Object trajectory between first (yellow point) and last frame (red point). The upper pair of frames is extracted from raw video in DAVIS_Perazzi2016, and the lower pair is from our synthesized dataset, S2M-DAVIS.
Figure 4: Qualitative examples for object replacement using state-of-the-art image editing tools, Qwen-Image-Editwu2025qwenimagetechnicalreport (left) and FLUX-Kontextmachalek-2020-kontext (right).
Figure 5: FLF2V-obj metrics provides object-centric insight by isolating the object from the scene and evaluating object consistency across the generated sequence. Search2Motion produces high-fidelity object movement and maintains object consistency across the generated sequence compared to DragAnything.
...and 16 more figures

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Abstract

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Authors

Abstract

Table of Contents

Figures (21)