Table of Contents
Fetching ...

MoRight: Motion Control Done Right

Shaowei Liu, Xuanchi Ren, Tianchang Shen, Huan Ling, Saurabh Gupta, Shenlong Wang, Sanja Fidler, Jun Gao

Abstract

Generating motion-controlled videos--where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints--demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.

MoRight: Motion Control Done Right

Abstract

Generating motion-controlled videos--where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints--demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.

Paper Structure

This paper contains 29 sections, 6 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Given a single input image, our method enables controllable interactive motion generation with motion causality reasoning. Left: Users can provide active motion (e.g. action of hand) to drive scene dynamics (forward reasoning) or specify desired passive outcomes (e.g. trajectory of teapot) and recover plausible driving actions (inverse reasoning). Right: The model further enables disentangled control of object motion and camera viewpoint, allowing users to explore the scene with custom viewpoints and motions.
  • Figure 2: Model architecture. Our model adopts a dual-stream architecture with shared weights to disentangle object motion from camera motion. The canonical stream encodes motion trajectories using a track encoder and learns motion in a fixed canonical view. The target stream encodes camera pose signals through a camera encoder. The resulting motion and camera conditions are injected into every attention block of the network. Cross-view self-attention connects the two streams, transferring motion learned in the canonical view to the target view and enabling disentangled camera–object motion generation.
  • Figure 3: Active vs. passive motion. The active object (hand) initiates the action, while the passive object (cloth) responds.
  • Figure 4: Data curation pipeline. Foundation models harley2025alltrackerhuang2025viperavi2024sam2 extract depth, camera poses, and tracks from raw videos. A VLM Qwen3-VL segments tracks into active/passive regions. We further optionally use a video-to-video model fu2026plenoptic to generate paired videos with the same object motion but different camera motions.
  • Figure 5: Disentangled camera–object control. MoRight enables independent control of object motion and camera viewpoint. Rows 1-3 fix the camera and vary object motion (rows 1-2: forward reasoning; row 3: inverse reasoning), while rows 4-6 fix object motion and vary camera motion.
  • ...and 10 more figures