Table of Contents
Fetching ...

SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

Yanan Wang, Renxi Wang, Yongxin Wang, Xuezhi Liang, Fajri Koto, Timothy Baldwin, Xiaodan Liang, Haonan Li

TL;DR

SimuScene tackles the challenge of teaching and evaluating LLMs to generate executable code that simulates physical dynamics from natural language. It introduces a large, automatically constructed yet quality-controlled dataset spanning five physics domains and 52 concepts, plus a 334-example human-verified test set and a Code-Video-Judge RL framework that uses vision-based video verification to train text-only models. Frontier LLMs show limited end-to-end performance (best Avg@8 around 21.5%), but training with SFT and RLVR substantially boosts end-to-end alignment, achieving 34.4% for a 7B model and 72.2% for a 32B model on Pass@8. The work demonstrates the value of vision-grounded rewards and data mixing with general reasoning data to enhance code-based physical simulations, with potential educational and research implications in physics visualization and automated teaching tools.

Abstract

Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning, yet their ability to accurately represent and simulate physical scenarios via code remains underexplored. We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios across five physics domains and 52 physical concepts. We build an automatic pipeline to collect data, with human verification to ensure quality. The final dataset contains 7,659 physical scenarios with 334 human-verified examples as the test set. We evaluated 10 contemporary LLMs and found that even the strongest model achieves only a 21.5% pass rate, demonstrating the difficulty of the task. Finally, we introduce a reinforcement learning pipeline with visual rewards that uses a vision-language model as a judge to train textual models. Experiments show that training with our data improves physical simulation via code while substantially enhancing general code generation performance.

SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

TL;DR

SimuScene tackles the challenge of teaching and evaluating LLMs to generate executable code that simulates physical dynamics from natural language. It introduces a large, automatically constructed yet quality-controlled dataset spanning five physics domains and 52 concepts, plus a 334-example human-verified test set and a Code-Video-Judge RL framework that uses vision-based video verification to train text-only models. Frontier LLMs show limited end-to-end performance (best Avg@8 around 21.5%), but training with SFT and RLVR substantially boosts end-to-end alignment, achieving 34.4% for a 7B model and 72.2% for a 32B model on Pass@8. The work demonstrates the value of vision-grounded rewards and data mixing with general reasoning data to enhance code-based physical simulations, with potential educational and research implications in physics visualization and automated teaching tools.

Abstract

Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning, yet their ability to accurately represent and simulate physical scenarios via code remains underexplored. We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios across five physics domains and 52 physical concepts. We build an automatic pipeline to collect data, with human verification to ensure quality. The final dataset contains 7,659 physical scenarios with 334 human-verified examples as the test set. We evaluated 10 contemporary LLMs and found that even the strongest model achieves only a 21.5% pass rate, demonstrating the difficulty of the task. Finally, we introduce a reinforcement learning pipeline with visual rewards that uses a vision-language model as a judge to train textual models. Experiments show that training with our data improves physical simulation via code while substantially enhancing general code generation performance.
Paper Structure (45 sections, 8 equations, 11 figures, 8 tables)

This paper contains 45 sections, 8 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Representative text-to-code-to-video examples from the SimuScene benchmark. Each video is rendered from LLM-generated code and illustrates the described dynamic physical process.
  • Figure 2: Overview of the SimuScene dataset construction process, including dynamic scenario and visual question generation, as well as scenario--reasoning trace--code consistency assessment.
  • Figure 3: Examples from SimuScene. A corresponding video is shown as the second entry in \ref{['fig:data_example']}.
  • Figure 4: The reward curve during RL training. All four rewards yield steadily improving signals.
  • Figure 5: Distributions of topics, scenario description token counts, and VLM verification questions in the SimuScene dataset.
  • ...and 6 more figures