Table of Contents
Fetching ...

PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

Shang Wu, Chenwei Xu, Zhuofan Xia, Weijian Li, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu

TL;DR

This work presents PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation, and establishes that domain-specialized reinforcement learning with compositional curricula surpasses general-purpose scaling for physics-aware generation.

Abstract

State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization: PhyPrompt-7B reaches 40.8\% joint success on VideoPhy2 (8.6pp gain), improving physical commonsense by 11pp (55.8\% to 66.8\%) while simultaneously increasing semantic adherence by 4.4pp (43.4\% to 47.8\%). Remarkably, our curriculum exceeds single-objective training on both metrics, demonstrating compositional prompt discovery beyond conventional multi-objective trade-offs. PhyPrompt outperforms GPT-4o (+3.8\% joint) and DeepSeek-V3 (+2.2\%, 100$\times$ larger) using only 7B parameters. The approach transfers zero-shot across diverse T2V architectures (Lavie, VideoCrafter2, CogVideoX-5B) with up to 16.8\% improvement, establishing that domain-specialized reinforcement learning with compositional curricula surpasses general-purpose scaling for physics-aware generation.

PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

TL;DR

This work presents PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation, and establishes that domain-specialized reinforcement learning with compositional curricula surpasses general-purpose scaling for physics-aware generation.

Abstract

State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization: PhyPrompt-7B reaches 40.8\% joint success on VideoPhy2 (8.6pp gain), improving physical commonsense by 11pp (55.8\% to 66.8\%) while simultaneously increasing semantic adherence by 4.4pp (43.4\% to 47.8\%). Remarkably, our curriculum exceeds single-objective training on both metrics, demonstrating compositional prompt discovery beyond conventional multi-objective trade-offs. PhyPrompt outperforms GPT-4o (+3.8\% joint) and DeepSeek-V3 (+2.2\%, 100 larger) using only 7B parameters. The approach transfers zero-shot across diverse T2V architectures (Lavie, VideoCrafter2, CogVideoX-5B) with up to 16.8\% improvement, establishing that domain-specialized reinforcement learning with compositional curricula surpasses general-purpose scaling for physics-aware generation.
Paper Structure (17 sections, 5 equations, 5 figures, 7 tables)

This paper contains 17 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Video Generated by CogVideoX-5B Using Different Prompts. We compare the original prompt A wine is poured from a bottle in to a glass with three enhanced versions: manual rewrite, GPT-4o, and PhyPrompt. Semantic Adherence (SA) and Physical Commonsense (PC) scores are shown for each. The original prompt fails to depict rising liquid levels (PC:3). Manual rewriting with "rises steadily" achieves perfect scores (SA:5, PC:5). GPT-4o emphasizes the "visible stream of wine" but slightly reduces semantic alignment (SA:4, PC:5). PhyPrompt automatically generates "slowly and smoothly," matching manual rewrite quality (SA:5, PC:5) without human intervention.
  • Figure 2: PhyPrompt GRPO Pipeline. For each input prompt, the LLM (Qwen2.5 Qwen2024) generates multiple enhanced prompts and for each enhanced prompt, T2V generator (CogVideoX-2B yang2024cogvideox) generates one video. Video evaluator (VideoPhy2-Auto bansal2025videophy2) scores each video. We utilize GRPO to optimize the LLM.
  • Figure 3: Cross-Generator Transfer of PhyPrompt. Performance comparison across different prompt enhancement methods on three text-to-video generation models (Lavie, CogVideoX-5B, and VideoCrafter2) measured by joint Physical Commonsense and Semantic Adherence (PC & SA) metrics. Our method consistently outperforms baseline approaches (Original, Promptist, PhyT2V) and similarly-sized Qwen2.5 models across all three video generation backbends. The highest performance (42.0%) is achieved by our 7B model on CogVideoX-5B. Gold stars indicate the best-performing method for each generator.
  • Figure 4: Training reward curves under the dynamic (orange) and static (purple) reward formulations. Each line is smoothed with a 10-step moving average; lightly shaded traces show the unsmoothed per-episode rewards. The dynamic curriculum converges more rapidly and plateaus higher (4.2) than the static baseline (3.5), indicating more effective optimization.
  • Figure 5: Frame‐by‐frame comparison of CogVideoX-2B outputs for the “hammer and nail” scenario. Top row shows results using the static‐reward prompt “Show a heavy hammer striking a wooden plank with precision, ensuring that the impact creates a realistic sound effect and deformation of the wood surface.” The hammer impacts the plank but no nail is visible or driven. Bottom row shows results using the dynamic‐reward prompt “Heavy hammer hits the nail into the wood plank. The hammer’s force is concentrated, causing high pressure that drives the nail deep into the plank.” Here the nail appears in frame 1, is pressed gradually into the wood across frames 2–3, and achieves full penetration by frame 4, with realistic wood deformation and depth of entry.