Table of Contents
Fetching ...

Motion-o: Trajectory-Grounded Video Reasoning

Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas

Abstract

Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{<motion/>} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.

Motion-o: Trajectory-Grounded Video Reasoning

Abstract

Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{<motion/>} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.
Paper Structure (33 sections, 8 equations, 7 figures, 5 tables)

This paper contains 33 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Motion-o: trajectory-grounded video reasoning. Although recent video models can produce fluent <think> traces, their reasoning is typically ungrounded, it does not explicitly tie claims back to where/when evidence occurs in the video, nor does it encode the motion that connects observations. Motion-o introduces a Spatial--Temporal--Trajectory (STT) evidence chain: timestamped <obj, box, t> observations paired with a structured <motion/> tag that explicitly links snapshots into a trajectory-faithful trace.
  • Figure 2: CoT vs. Motion Chain-of-Thought (MCoT). CoT in meng2025open yields sparse temporal bounding boxes, forcing implicit inter-frame interpolation. MCoT adds an object-conditioned <motion/> tag that explicitly parameterizes the dynamics between observations, making the evidence chain trajectory-consistent and reducing prior-driven extrapolation.
  • Figure 3: Motion-o end-to-end pipeline. Starting from the spatio-temporal grounding dataset from meng2025open with keyframe boxes plus the addition of our dense interpolated bounding boxes (purple box), we (i) compute track-derived motion primitives (direction/speed/scale) and inject them into the reasoning trace as an MCoT <motion/> tag (blue box), (ii) perform supervised fine-tuning (Stage 1) to teach the model the structured STT/MCoT format, and (iii) perform RL (GSPO) (Stage 2) with original vs. motion-masked videos to reward motion tags that align with trajectories and change when temporal evidence is removed.
  • Figure 4: Qualitative examples of Motion-o reasoning. Top: The model tracks Sheldon across three timestamps with varying camera angles and emits two <motion/> tags, both correctly identifying the subject as stationary despite significant viewpoint changes. The dense multi-point grounding enables the model to distinguish true stationarity from apparent visual displacement caused by camera cuts. Bottom: The model grounds a duck at four consecutive timestamps as it swims through a group, summarizing the trajectory with a single motion tag capturing eastward direction at moderate speed.
  • Figure S5: Motion data compass showing the distribution of observations across direction, speed, and scale. Directional wedges radiate from the center using square-root scaling: N (166), NE (246), E (1,244), SE (377), S (325), SW (321), W (1,171), and NW (260); the central circle is stationary (5,582). The inner ring encodes speed: stationary (5,582), slow (1,396), moderate (1,266), fast (1,448). The outer ring encodes scale: stable (5,751), approaching (2,322), receding (1,619). Arc lengths are proportional to each category's share.
  • ...and 2 more figures