Table of Contents
Fetching ...

What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards

Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, Dimitris Samaras

TL;DR

This work tackles the mismatch between visual realism and physical realism in video generation by introducing NewtonRewards, a post-training framework that uses verifiable rewards derived from measurable proxies (optical flow for velocity and visual features for mass) to enforce Newtonian dynamics. It defines a kinematic constraint of constant image-plane acceleration and a mass-conservation constraint, combining them into a post-training objective that guides diffusion-based video generators. Through the NewtonBench-60K benchmark across five Newtonian Motion Primitives, NewtonRewards achieves consistent improvements in physical plausibility, motion smoothness, and temporal coherence, with strong ID and OOD generalization. The results suggest that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation and point to a general framework for enforcing other physical laws via proxy-based, differentiable constraints.

Abstract

Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose $\texttt{NewtonRewards}$, the first physics-grounded post-training framework for video generation based on $\textit{verifiable rewards}$. Instead of relying on human or VLM feedback, $\texttt{NewtonRewards}$ extracts $\textit{measurable proxies}$ from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate $\texttt{NewtonRewards}$ on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, $\texttt{NewtonBench-60K}$. Across all primitives in visual and physics metrics, $\texttt{NewtonRewards}$ consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.

What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards

TL;DR

This work tackles the mismatch between visual realism and physical realism in video generation by introducing NewtonRewards, a post-training framework that uses verifiable rewards derived from measurable proxies (optical flow for velocity and visual features for mass) to enforce Newtonian dynamics. It defines a kinematic constraint of constant image-plane acceleration and a mass-conservation constraint, combining them into a post-training objective that guides diffusion-based video generators. Through the NewtonBench-60K benchmark across five Newtonian Motion Primitives, NewtonRewards achieves consistent improvements in physical plausibility, motion smoothness, and temporal coherence, with strong ID and OOD generalization. The results suggest that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation and point to a general framework for enforcing other physical laws via proxy-based, differentiable constraints.

Abstract

Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose , the first physics-grounded post-training framework for video generation based on . Instead of relying on human or VLM feedback, extracts from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, . Across all primitives in visual and physics metrics, consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.

Paper Structure

This paper contains 25 sections, 1 theorem, 13 equations, 12 figures, 5 tables.

Key Result

Proposition 1

For an object governed by time-invariant external forces, the discrete second-order derivative of its optical-flow field predicted by $\boldsymbol{\Psi}$ vanishes: This is the optical-flow realization of Newton’s Second Law in the video domain, enforcing constant acceleration across all five Newtonian Motion Primitives.

Figures (12)

  • Figure 1: NewtonRewards enforce physical laws in video generation. Shown is a parabolic throw scenario from our NewtonBench-60K dataset. Baseline supervised fine-tuning (SFT) produces implausible motion violating Newtonian dynamics. Our NewtonRewards post-training restores parabolic trajectories that follow constant-acceleration behavior predicted by physics.
  • Figure 2: Illustration of the five NMPs in the proposed NewtonBench-60K dataset. Left: corresponding free-body diagrams showing dominant forces and accelerations. Right: rendered trajectories from our Kubric-based greff2022kubric simulator, demonstrating constant-acceleration dynamics in diverse environments.
  • Figure 3: Physics-Grounded Video Post-Training Pipeline. Our method improves a pre-trained video generator by using physics-based rewards. Utility models (optical flow $\Psi$ and V-JEPA 2) process the generated video to compute measurable proxies, from which kinematic and mass conservation rewards are derived to enforce explicit physics constraints.
  • Figure 4: Relative performance change across Newtonian Motion Primitives. Percentage improvements over the SFT baseline across all five NMPs. Depth and Segmentation provide modest gains on simple motions but degrade on ramp dynamics, while Optical Flow shows highly variable and unstable behavior. In contrast, NewtonRewards delivers consistent positive improvements across all primitives, demonstrating robust generalization to diverse Newtonian dynamics.
  • Figure 5: Qualitative comparison of post-training strategies on the NewtonBench-60K ramp-slide down scenario. Clear differences emerge when inspecting the temporal evolution across frames (left→right). For SFT and all PISA variants (Depth, Seg, Optical Flow), the cube exhibits inconsistent deceleration and unstable surface contact-evident in Frames 2–4, where the cube tilts unnaturally, slips erratically, or momentarily “floats’’ above the ramp. PISA Optical Flow especially shows noticeable jitter and non-smooth frame-to-frame motion. In contrast, NewtonRewards maintains stable grounding and smooth, constant-acceleration motion across all frames.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Proposition 1: Newtonian Kinematic Constraint