Table of Contents
Fetching ...

MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

Xiangyu Bai, He Liang, Bishoy Galoaa, Utsav Nandi, Shayda Moezzi, Yuhang He, Sarah Ostadabbas

TL;DR

The paper addresses the gap between visually realistic text-to-video outputs and physically valid motion by introducing MoReGen, a multi-agent, physics-grounded T2V pipeline that translates natural language prompts into executable Newtonian simulations and renders physically coherent videos. It leverages a supervised fine-tuned text-parser, a code-writer, a video-render agent, and an evaluator to form an iterative feedback loop, producing reproducible physics-based videos. To evaluate physical validity, the authors introduce MoRe Set, a 1,275-video benchmark across nine Newtonian phenomena with detailed object trajectories, and MoRe Metrics for trajectory-centric, motion-consistency evaluation, complemented by existing physics benchmarks. Experiments show that state-of-the-art T2V models struggle with physical reasoning, while MoReGen achieves superior trajectory fidelity and coherence, highlighting the need for physics-aware evaluation in video synthesis. The work lays a principled foundation for physics-grounded T2V and points toward future extensions to photorealistic 3D rendering and broader dynamical systems.

Abstract

While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.

MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

TL;DR

The paper addresses the gap between visually realistic text-to-video outputs and physically valid motion by introducing MoReGen, a multi-agent, physics-grounded T2V pipeline that translates natural language prompts into executable Newtonian simulations and renders physically coherent videos. It leverages a supervised fine-tuned text-parser, a code-writer, a video-render agent, and an evaluator to form an iterative feedback loop, producing reproducible physics-based videos. To evaluate physical validity, the authors introduce MoRe Set, a 1,275-video benchmark across nine Newtonian phenomena with detailed object trajectories, and MoRe Metrics for trajectory-centric, motion-consistency evaluation, complemented by existing physics benchmarks. Experiments show that state-of-the-art T2V models struggle with physical reasoning, while MoReGen achieves superior trajectory fidelity and coherence, highlighting the need for physics-aware evaluation in video synthesis. The work lays a principled foundation for physics-grounded T2V and points toward future extensions to photorealistic 3D rendering and broader dynamical systems.

Abstract

While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.

Paper Structure

This paper contains 16 sections, 2 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of our multi-agent motion-reasoning engine (MoReGen) for physics-grounded text-to-video synthesis. MoReGen focuses on achieving high-precision Newtonian motion through coordinated multi-agent reasoning. Given a natural language prompt, the text-parser agent $\mathcal{A}_{\text{text}}$ extracts physical parameters and motion descriptors, which are then translated into executable simulation code by the code-writer agent $\mathcal{A}_{\text{coder}}$. The resulting code is executed within a sandboxed video-render agent $\mathcal{A}_{\text{render}}$ to produce a physically plausible video. This video serves as the basis for evaluator feedback, guiding enhancements in code robustness and physical fidelity for subsequent iterations. By leveraging open-source LLMs, few-shot tuning of $\mathcal{A}_{\text{text}}$ and multi-modal evaluator, MoReGen enables accurate and reproducible Newtonian motion synthesis from natural language instructions.
  • Figure 2: Sample frames and corresponding text prompts from our MoReSet dataset. Each image (extracted from videos) illustrates a distinct Newtonian physics phenomenon. The provided annotation for the pendulum corresponds to the rightmost video in the second row, with highlighted text emphasizing the numerical relationships depicted in the scene.
  • Figure 3: Qualitative comparison of our model with recent open-source and commercial models, prompted to generate a video of Newton's cradle. We used the same prompt across the board: "Generate a video that showcase the following scene: Five shiny metal balls of a newton's cradle is visible, along with parts of a single vertical string for each metal ball respectively. These strings keeps their respective metal ball suspended. The top part of the newton's cradle is not visible. The camera faces all the five metal balls. The first and leftmost ball is at an angle of 30 degrees from the cradle and released. Due to gravity, the ball comes and strikes the second ball from the left. This causes momentum to be transferred to the fifth and the right most ball which is launched at a slightly lesser angle, having lost some momentum. This process keeps repeating itself till the rightmost ball has lost a lot of momentum when the video ends." For Grok Imagine, we always select the first video; for WISA, we use Qwen3-4B to generate asset .json file from our prompt.