Table of Contents
Fetching ...

GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, Shifeng Chen

TL;DR

GPT4Motion tackles the cost and motion-coherence challenges of text-to-video by combining GPT-4 driven Blender scripting with physics-based Blender simulations and diffusion-based rendering. The workflow translates a user prompt into a Blender Python script that drives physics, renders edge and depth maps, and uses ControlNet-conditioned Stable Diffusion XL with temporal constraints to produce frames aligned to the prompt. A core contribution is the cross-frame attention CFA with a tunable $\alpha$ that blends information from the first frame and current frames to improve temporal coherence while managing flicker. Across three basic motion scenarios, GPT4Motion achieves higher motion smoothness, better prompt alignment (CLIP), and lower flickering than four baselines, with a 30-participant user study voting for it, demonstrating a practical training-free path for physics-aware T2V generation and highlighting how LLM-driven scripting can leverage conventional 3D engines for video synthesis.

Abstract

Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.

GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

TL;DR

GPT4Motion tackles the cost and motion-coherence challenges of text-to-video by combining GPT-4 driven Blender scripting with physics-based Blender simulations and diffusion-based rendering. The workflow translates a user prompt into a Blender Python script that drives physics, renders edge and depth maps, and uses ControlNet-conditioned Stable Diffusion XL with temporal constraints to produce frames aligned to the prompt. A core contribution is the cross-frame attention CFA with a tunable that blends information from the first frame and current frames to improve temporal coherence while managing flicker. Across three basic motion scenarios, GPT4Motion achieves higher motion smoothness, better prompt alignment (CLIP), and lower flickering than four baselines, with a 30-participant user study voting for it, demonstrating a practical training-free path for physics-aware T2V generation and highlighting how LLM-driven scripting can leverage conventional 3D engines for video synthesis.

Abstract

Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.
Paper Structure (37 sections, 3 equations, 9 figures, 2 tables)

This paper contains 37 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Comparison of the video results generated by different text-to-video models with the prompt "A basketball free falls in the air". Best viewed with https://www.adobe.com/acrobat/pdf-reader.html for animation.
  • Figure 2: The architecture of our GPT4Motion. First, the user prompt is inserted into our designed prompt template. Then, the Python script generated by GPT-4 drives the Blender physics engine to simulate the corresponding motion, producing sequences of edge maps and depth maps. Finally, two ControlNets are employed to constrain the physical motion of video frames generated by Stable Diffusion, where a temporal consistency constraint is designed to enforce the coherence among frames.
  • Figure 3: Our prompt template designed for GPT-4. It contains information about functions, external assets, and instruction. The user prompt is inserted into the placeholder "{PROMPT}".
  • Figure 4: GPT4Motion's results on basketball drop and collision. Best viewed with https://www.adobe.com/acrobat/pdf-reader.html for animation.
  • Figure 7: GPT4Motion's results on the water pouring. Best viewed with https://www.adobe.com/acrobat/pdf-reader.html for animation.
  • ...and 4 more figures