Table of Contents
Fetching ...

NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics

Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, Stanley H. Chan

TL;DR

NewtonGen addresses the gap between visually plausible yet physically inconsistent video generation and the need for controllable dynamics. It introduces Neural Newtonian Dynamics (NND), a physics-informed neural ODE that learns latent Newtonian motions from a small set of physics-clean data, and integrates it with a motion-controlled video generator via optical-flow conditioning. The approach yields physically consistent trajectories and precise parameter control across diverse motion types, outperforming existing methods in physical plausibility. This framework enhances generalization to out-of-distribution dynamics and provides interpretable, user-driven control over motion in text-to-video synthesis.

Abstract

A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.

NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics

TL;DR

NewtonGen addresses the gap between visually plausible yet physically inconsistent video generation and the need for controllable dynamics. It introduces Neural Newtonian Dynamics (NND), a physics-informed neural ODE that learns latent Newtonian motions from a small set of physics-clean data, and integrates it with a motion-controlled video generator via optical-flow conditioning. The approach yields physically consistent trajectories and precise parameter control across diverse motion types, outperforming existing methods in physical plausibility. This framework enhances generalization to out-of-distribution dynamics and provides interpretable, user-driven control over motion in text-to-video synthesis.

Abstract

A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.

Paper Structure

This paper contains 30 sections, 7 equations, 31 figures, 3 tables, 1 algorithm.

Figures (31)

  • Figure 1: NewtonGen generates physically-consistent videos from text prompts, with diverse dynamic perception (a), and precise parameter control (b).
  • Figure 2: The overall framework of NewtonGen. a) Neural Newtonian Dynamics (NND) employs physics-informed linear neural ODEs combined with an MLP to build a general dynamics learning framework suitable for diverse motions. b) We train NND on a physics-clean dataset to capture the underlying dynamics. c) Using the learned NND together with a data-driven motion-controlled model, we generate physically plausible and controllable videos.
  • Figure 3: Visual comparisons of different text-to-video generation methods across diverse physical dynamics, where our method consistently shows strong physical consistency.
  • Figure 4: NewtonGen generates videos that can accurately reflect user-specified initial physical parameters, including object position, velocity, angle, shape and size.
  • Figure 5: Sample physics-clean videos generated by our simulator.
  • ...and 26 more figures