SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

Yanan Wang; Renxi Wang; Yongxin Wang; Xuezhi Liang; Fajri Koto; Timothy Baldwin; Xiaodan Liang; Haonan Li

SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

Yanan Wang, Renxi Wang, Yongxin Wang, Xuezhi Liang, Fajri Koto, Timothy Baldwin, Xiaodan Liang, Haonan Li

TL;DR

SimuScene tackles the challenge of teaching and evaluating LLMs to generate executable code that simulates physical dynamics from natural language. It introduces a large, automatically constructed yet quality-controlled dataset spanning five physics domains and 52 concepts, plus a 334-example human-verified test set and a Code-Video-Judge RL framework that uses vision-based video verification to train text-only models. Frontier LLMs show limited end-to-end performance (best Avg@8 around 21.5%), but training with SFT and RLVR substantially boosts end-to-end alignment, achieving 34.4% for a 7B model and 72.2% for a 32B model on Pass@8. The work demonstrates the value of vision-grounded rewards and data mixing with general reasoning data to enhance code-based physical simulations, with potential educational and research implications in physics visualization and automated teaching tools.

Abstract

Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning, yet their ability to accurately represent and simulate physical scenarios via code remains underexplored. We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios across five physics domains and 52 physical concepts. We build an automatic pipeline to collect data, with human verification to ensure quality. The final dataset contains 7,659 physical scenarios with 334 human-verified examples as the test set. We evaluated 10 contemporary LLMs and found that even the strongest model achieves only a 21.5% pass rate, demonstrating the difficulty of the task. Finally, we introduce a reinforcement learning pipeline with visual rewards that uses a vision-language model as a judge to train textual models. Experiments show that training with our data improves physical simulation via code while substantially enhancing general code generation performance.

SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

TL;DR

Abstract

Paper Structure (45 sections, 8 equations, 11 figures, 8 tables)

This paper contains 45 sections, 8 equations, 11 figures, 8 tables.

Introduction
Related Work
Reasoning Evaluation for LLMs.
Physical Understanding and Reasoning with LLMs
Reinforcement Learning from Vision Signal
SimuScene
Dataset Construction
Domain Selection
Scenario & Verification Question Generation
Quality Improving
Dataset Statistics & Splits
Dataset Quality Control
Evaluation Pipeline
Evaluation of Frontier LLMs
Metrics
...and 30 more sections

Figures (11)

Figure 1: Representative text-to-code-to-video examples from the SimuScene benchmark. Each video is rendered from LLM-generated code and illustrates the described dynamic physical process.
Figure 2: Overview of the SimuScene dataset construction process, including dynamic scenario and visual question generation, as well as scenario--reasoning trace--code consistency assessment.
Figure 3: Examples from SimuScene. A corresponding video is shown as the second entry in \ref{['fig:data_example']}.
Figure 4: The reward curve during RL training. All four rewards yield steadily improving signals.
Figure 5: Distributions of topics, scenario description token counts, and VLM verification questions in the SimuScene dataset.
...and 6 more figures

SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

TL;DR

Abstract

SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

Authors

TL;DR

Abstract

Table of Contents

Figures (11)